Data Preprocessing and Augmentation

Duration: 7 min

This module delves into the essential steps of data preprocessing and augmentation for neural networks. Proper preprocessing ensures that the data is in a suitable format for training, while augmentation techniques help to increase the diversity of the dataset, leading to more robust and generalizable models.

Data Normalization and Standardization

Data normalization and standardization are critical preprocessing steps. Normalization scales the data to a range of [0, 1], while standardization transforms the data to have a mean of 0 and a standard deviation of 1. These techniques help in speeding up the training process and improving the model's performance.

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])

# Normalization
scaler_norm = MinMaxScaler()
normalized_data = scaler_norm.fit_transform(data)

# Standardization
scaler_std = StandardScaler()
standardized_data = scaler_std.fit_transform(data)

print('Normalized Data:', normalized_data)
print('Standardized Data:', standardized_data)

Try it in Google Colab:

Normalized Data: [[0.   0.  ]
 [0.333 0.333]
 [0.667 0.667]
 [1.   1.  ]]
Standardized Data: [[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]

Data Augmentation Techniques

Data augmentation involves creating modified copies of the data to increase the dataset's size and variability. Common techniques include rotation, scaling, flipping, and adding noise. These methods help prevent overfitting and improve the model's ability to generalize to new data.

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Sample image
image = np.array([[[1], [2]], [[3], [4]]]).reshape((1, 2, 2, 1))

# Data augmentation
datagen = ImageDataGenerator(rotation_range=45, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True)

# Generating augmented images
augmented_images = [image for image in datagen.flow(image, batch_size=1)]

print('Original Image:', image)
print('Augmented Image:', augmented_images[0][0])

💡 Tip: When applying data augmentation, ensure that the transformations are realistic and relevant to the problem domain to avoid introducing noise that could degrade model performance.

❓ What is the purpose of data normalization?

To increase the dataset size To scale data to a range of [0, 1] To shuffle the dataset To add noise to the data

❓ Which technique is used to prevent overfitting by creating modified copies of data?

Data normalization Data standardization Data augmentation Data shuffling

Data Preprocessing and Augmentation

Data Normalization and Standardization

Data Augmentation Techniques

Related Courses