Machine Learning — Garbage in Garbage Out

Developer to Data Scientist — Part II

6 min readMay 9, 2023

Writing this article to share my thoughts and learning, as a student of Data Science and Artificial Intelligence about the importance of training data.

Machine Learning, Deep Learning, Neural Networks, and Artificial Intelligence these terms are not Hokum. Once you understand that this intelligence is nothing but memorization and training the models to predict correct estimate and control it via weights and biases + backpropagation and repeat the same in the form of layers.

I will start by sharing one basic example of a Neural Network which is CNN for the Image Classification. Firstly, Let's clear a few terms from a bird's eye view.

What is Neural Network

A Neural Network is a type of machine-learning algorithm that consists of layers of interconnected nodes, or neurons, that process and transmit information.

Training a neural network involves adjusting the Weights and Biases of the connections between neurons so that the network can learn to make accurate predictions or classifications. This is typically done using a process called Backpropagation, where the network’s output is compared to the desired output and the error is propagated back through the network to adjust the weights and biases accordingly.

What is a Convolutional neural network aka CNN?

CNN is a type of neural network that is commonly used in image and video processing applications.

Convolution is a mathematical operation that applies a filter (aka kernel) to a small section of the input data to extract a specific feature.

Pooling is used to downsample the feature map and reduce the spatial dimensions of the data. Activation functions introduce non-linearity to the model, allowing it to learn more complex relationships between features.

Now, I will share with you one example of a training Sequential Model that will predict the image classes.

Firstly, I will provide the correct image in the training, After that, we will corrupt the training input data and rebuild the model and see that the model is predicting the image correctly based on its understanding but providing the wrong label due to wrong input.

Refer complete Notebook code and input and model file here.

GitHub - RitreshGirdhar/CNN-image-classification

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Import libraries

!pip install opencv-python
import numpy as np
import os
from sklearn.metrics import confusion_matrix
import seaborn as sn; sn.set(font_scale=1.4)
from sklearn.utils import shuffle           
import matplotlib.pyplot as plt             
import cv2                                 
import tensorflow as tf                
from tqdm import tqdm

Will classify six kinds of images.

class_names = ['mountain', 'street', 'glacier', 'buildings', 'sea', 'forest']
class_names_label = {class_name:i for i, class_name in enumerate(class_names)}
nb_classes = len(class_names)

IMAGE_SIZE = (150, 150)

Define load data functions and display image function

def load_data():
    datasets = ['./CNN-image-classification/input/seg_train', 
                './CNN-image-classification/input/seg_test']
    output = []
    
    # Iterate through training and test sets
    for dataset in datasets:
        images = []
        labels = []
        print("Loading {}".format(dataset))
        # Iterate through each folder corresponding to a category
        for folder in os.listdir(dataset):
            label = class_names_label[folder]
            
            # Iterate through each image in our folder
            for file in tqdm(os.listdir(os.path.join(dataset, folder))):
                
                # Get the path name of the image
                img_path = os.path.join(os.path.join(dataset, folder), file)
                
                # Open and resize the img
                image = cv2.imread(img_path)
                image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
                image = cv2.resize(image, IMAGE_SIZE) 
                
                # Append the image and its corresponding label to the output
                images.append(image)
                labels.append(label)
                
        images = np.array(images, dtype = 'float32')
        labels = np.array(labels, dtype = 'int32')   
        
        output.append((images, labels))

    return output


def display_random_image(class_names, images, labels):
    index = np.random.randint(images.shape[0])
    plt.figure()
    plt.imshow(images[index])
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.title('Image #{} : '.format(index) + class_names[labels[index]])
    plt.show()



def plot_accuracy_loss(history):
    fig = plt.figure(figsize=(10,5))
    # Plot accuracy
    plt.subplot(221)
    plt.plot(history.history['accuracy'],'bo--', label = "accuracy")
    plt.plot(history.history['val_accuracy'], 'ro--', label = "val_accuracy")
    plt.title("train_accuracy vs val_accuracy")
    plt.ylabel("accuracy")
    plt.xlabel("epochs")
    plt.legend()
    # Plot loss function
    plt.subplot(222)
    plt.plot(history.history['loss'],'bo--', label = "loss")
    plt.plot(history.history['val_loss'], 'ro--', label = "val_loss")
    plt.title("train_loss vs val_loss")
    plt.ylabel("loss")
    plt.xlabel("epochs")
    plt.legend()
    plt.show()

Load Data

(train_images, train_labels), (test_images, test_labels) = load_data()
train_images, train_labels = shuffle(train_images, train_labels, random_state=25)

Print the shape of the training dataset

n_train = train_labels.shape[0]
n_test = test_labels.shape[0]

print ("Number of training examples: {}".format(n_train))
print ("Number of testing examples: {}".format(n_test))
print ("Each image is of size: {}".format(IMAGE_SIZE))

Plot Training & Test data

import pandas as pd

_, train_counts = np.unique(train_labels, return_counts=True)
_, test_counts = np.unique(test_labels, return_counts=True)
pd.DataFrame({'train': train_counts,
                    'test': test_counts}, 
             index=class_names
            ).plot.bar()
plt.show()

Display Random image

display_random_image(class_names, train_images, train_labels)

Define CNN Model

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), activation = 'relu', input_shape = (150, 150, 3)), 
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(32, (3, 3), activation = 'relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation=tf.nn.relu),
    tf.keras.layers.Dense(6, activation=tf.nn.softmax)
])

The above CNN model architecture consists of 7 layers, the initial 4 are for convolution and pooling. And last 3 layers are fully connected layers,

Activation function used — Relu(Rectified Linear Unit) — mathematically as ReLU(x) = max(0, x), which means that the output of the function is the maximum of zero, and the input x

Convolution Filters — 32 filters of 3*3 size

model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics=['accuracy'])
history = model.fit(train_images, train_labels, batch_size=128, epochs=20, validation_split = 0.2)

Plot Accuracy-Loss

plot_accuracy_loss(history)

Save the Model

model.save('CNN-image-classification/models/correctImageClassifier')

Predict the image using Model

predictions = model.predict(test_images)     # Vector of probabilities
pred_labels = np.argmax(predictions, axis = 1) # We take the highest probability

display_random_image(class_names, test_images, pred_labels)

Let’s Corrupt the training data

Rename the forest directory to buildings and the directory of the buildings to the forest.

Repeat the above steps, reload the data and rebuild the same model and save the model with a different name

model.save('CNN-image-classification/models/incorrectImageClassifier')

Now predict the image of the forest using this incorrect model, which is trained to identify buildings -> forest and forest -> buildings

img_path = "./CNN-image-classification/input/seg_pred/22.jpg"
image = cv2.imread(img_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, IMAGE_SIZE) 
plt.figure()
plt.imshow(image)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.title('Image')
plt.show()

t_images = []
t_images.append(image)
t_images = np.array(t_images, dtype = 'float32')
t_predictions = model.predict(t_images)
pred_labels = np.argmax(t_predictions, axis = 1)
class_names[pred_labels[0]]

It is classifying the forest as a building because the training data we provided to the CNN was the wrong type. We provided buildings images as forest and forest images as buildings.

It shows how important it is to provide valid input data, otherwise, the supervised and semi-supervised ML models we use for some tasks won't sustain the real-world complexity and challenges.

Thanks for reading, Happy Learning. Follow me Ritresh Girdhar

The next article will explain how to build QA Chatbot using the Meta BaBI project and LSTM RNN

References

Deep Learning with Python, Second Edition

Deep Learning with Python, Second Edition eBook : Chollet, François : Amazon.ca: Books

www.amazon.ca