P04 — Image Classification using CNN

Aim

To design, train, and evaluate a Convolutional Neural Network (CNN) with exactly two convolutional layers for classifying images. In this iteration, we focus on the color images of the CIFAR-10 dataset to extract spatial hierarchies for robust multiclass prediction.

Prerequisites

Python & NumPy

TensorFlow / Keras API

Convolution Operation

Activation Functions

Pooling Layers

Backpropagation

Softmax & Cross-Entropy

CIFAR-10 / MNIST Dataset

Theory

A Convolutional Neural Network (CNN) is a deep learning architecture explicitly designed for grid-like data like images. Unlike Dense (fully connected) networks that flatten images into 1D vectors and destroy spatial awareness, CNNs utilize 2D filter matrices (kernels) that slide over the image to detect local features like edges, curves, or textures. This mechanism ensures translation equivariance — a feature learned in the top-left corner can be recognized if it appears in the bottom-right corner.

In this practical, we utilize the CIFAR-10 dataset, which presents a far more complex challenge than handwritten digits (MNIST). CIFAR-10 contains 60,000 color images representing 10 classes (airplanes, automobiles, birds, etc.). Because they are color images, the input shape is 3D: 32×32×3 (Height × Width × RGB Channels).

The architecture pairs each Convolutional layer with a MaxPooling2D layer. The Conv2D layer uses filters to extract high-dimensional feature maps, while the MaxPooling2D layer shrinks the spatial dimensions (downsampling) by picking the maximum value in a small 2x2 grid. This drastically reduces the computational load and introduces spatial invariance to minor shifts.

Tracing the data shape through our network:
1. Input: (32, 32, 3)
2. Conv2D (32 filters, 3x3): (30, 30, 32) (Dimensions shrink slightly because we aren't using padding).
3. MaxPool2D (2x2): (15, 15, 32)
4. Conv2D (64 filters, 3x3): (13, 13, 64)
5. MaxPool2D (2x2): (6, 6, 64)
6. Flatten: 6×6×64 = 2304 features.
7. Dense (64) → Dense (10, Softmax).

The final Dense layer uses a Softmax activation to convert the 10 final output values into a normalized probability distribution, where the highest probability dictates the model's prediction. The network is trained by minimizing the Sparse Categorical Cross-Entropy loss using the Adam optimizer.

Input
32×32×3

CIFAR-10 Image

→

Conv2D
32 filters
3×3, ReLU

30×30×32

→

MaxPool
2×2

15×15×32

→

Conv2D
64 filters
3×3, ReLU

13×13×64

→

MaxPool
2×2

6×6×64

→

Flatten
→ Dense 64
ReLU

2304 → 64

→

Dense 10
Softmax

Output Probs

Algorithm / Step-by-Step

Import the tensorflow.keras libraries (datasets, layers, models).
Load the CIFAR-10 dataset using datasets.cifar10.load_data().
Normalize the pixel values from the range [0, 255] to [0.0, 1.0] by dividing the arrays by 255.0.
Initialize a new Sequential() model.
Add the First Convolutional Block:
- Conv2D with 32 filters, 3x3 kernel size, ReLU activation, and input_shape=(32, 32, 3).
- MaxPooling2D with a 2x2 pool size.
Add the Second Convolutional Block:
- Conv2D with 64 filters, 3x3 kernel size, and ReLU activation.
- MaxPooling2D with a 2x2 pool size.
Add the Classification Head:
- Flatten() to unroll the 3D feature maps into a 1D vector.
- Dense layer with 64 units and ReLU activation.
- Dense output layer with 10 units (for the 10 classes) and Softmax activation.
Compile the model using the 'adam' optimizer, 'sparse_categorical_crossentropy' loss function, and track 'accuracy'.
Fit the model using model.fit() on the training data for 10 epochs, passing the test set for validation.
Evaluate and plot the learning curves to analyze model performance.

Key Code Concepts

Snippet 1 — Data Loading and Preprocessing

from tensorflow.keras import datasets, layers, models

# Load CIFAR-10 — returns numpy arrays
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

print(f'Training images shape: {train_images.shape}')
# Output: Training images shape: (50000, 32, 32, 3)

Unlike MNIST, the CIFAR-10 images are provided natively with 3 color channels (RGB), so their shape is already 4D: (samples, height, width, channels). This means no manual array reshaping with np.newaxis is required prior to feeding them into a Conv2D layer.

Snippet 2 — Building the 2-Layer CNN Architecture

model = models.Sequential()

# First Convolutional Layer
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))

# Second Convolutional Layer
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

# Flatten and Dense layers for classification
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax')) # 10 classes

Notice that the input_shape is explicitly provided only in the very first layer. TensorFlow dynamically calculates the shape of all subsequent tensors automatically. The number of filters increases (32 to 64) as the spatial dimensions decrease, enabling the model to learn more complex feature combinations.

Snippet 3 — Compiling and Training

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

Because the CIFAR-10 labels are provided as integers representing class indices (e.g., 0 for airplane, 1 for automobile) rather than one-hot encoded binary arrays, we strictly use the sparse_categorical_crossentropy loss function. The training loop returns a history dictionary which is useful for plotting.

Expected Output

model.summary(): A printed table displaying the layers, output shapes, and parameter counts. The total number of trainable parameters will be exactly 167,562.

Conv1 Parameters: 32 filters * (3x3 * 3 channels) + 32 biases = 896
Conv2 Parameters: 64 filters * (3x3 * 32 channels) + 64 biases = 18,496
Dense1 Parameters: 64 units * 2304 inputs + 64 biases = 147,520
Dense2 Parameters: 10 units * 64 inputs + 10 biases = 650

Training log (10 epochs): Because CIFAR-10 contains complex, noisy real-world images, the starting accuracy will be around ~35-40%. By epoch 10, the training accuracy should climb to approximately ~70-75%.

Accuracy plot: The plot will show training accuracy rising steadily. You may notice the validation accuracy plateauing slightly lower than the training accuracy, indicating the beginning stages of overfitting (which is normal for this simple, unregularized 2-layer network on CIFAR-10).

Viva Questions & Answers

Q1. What is the mathematical operation performed by a Conv2D layer?

A Conv2D layer performs a discrete 2D cross-correlation (commonly called convolution) between the input feature map and a learnable kernel. For each position in the output, the operation computes the dot product of the kernel and the corresponding local patch of the input image. The kernel slides across the entire input, producing a feature map that highlights where specific visual patterns exist.

Q2. Why does the first Conv2D layer use 32 filters and the second uses 64?

This is a standard deep learning design principle: deeper layers learn more abstract, compositional features and typically need more filters to represent the increased diversity of patterns. The first layer detects low-level features like colored edges, while the second layer combines these into more complex shapes like wheels or beaks. Doubling filters balances representational capacity with the computational savings gained from pooling layers.

Q3. What does MaxPooling2D(2,2) do and why is it used?

MaxPooling2D(2,2) partitions each feature map into non-overlapping 2×2 windows and takes the maximum value from each window, effectively reducing the spatial dimensions by half (e.g., from 30x30 to 15x15). It serves three purposes: (1) drastically reduces parameters and computational load for subsequent layers, (2) introduces translation invariance (small shifts in the input produce the same pooled output), and (3) acts as a light regularizer by discarding weaker activations.

Q4. Why is 'sparse_categorical_crossentropy' used instead of 'categorical_crossentropy'?

Both functions compute the exact same cross-entropy loss, but they differ in how they expect the target labels to be formatted. sparse_categorical_crossentropy is used when labels are provided as integer indices (e.g., label = 3). categorical_crossentropy requires labels to be explicitly one-hot encoded arrays (e.g., label = [0,0,0,1,0...]). Using the sparse version saves memory by avoiding the manual one-hot encoding step.

Q5. How would you modify this code to run on the MNIST dataset instead of CIFAR-10?

Three simple changes are required:
1. Change the loading function to datasets.mnist.load_data().
2. Because MNIST images are flat 28x28 grayscale arrays without a color channel dimension, you must reshape them using train_images = train_images[..., tf.newaxis] so they become 4D tensors compatible with Conv2D.
3. Change the input_shape parameter in the first Conv2D layer from (32, 32, 3) to (28, 28, 1).

NIELIT Ropar

Image Classification using
CNN with 2 Convolutional Layers (CIFAR-10)

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Image Classification usingCNN with 2 Convolutional Layers (CIFAR-10)

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Image Classification using
CNN with 2 Convolutional Layers (CIFAR-10)