Aim
To design, train, and evaluate a Convolutional Neural Network (CNN) with exactly two convolutional layers for classifying images. In this iteration, we focus on the color images of the CIFAR-10 dataset to extract spatial hierarchies for robust multiclass prediction.
Prerequisites
Theory
A Convolutional Neural Network (CNN) is a deep learning architecture explicitly designed for grid-like data like images. Unlike Dense (fully connected) networks that flatten images into 1D vectors and destroy spatial awareness, CNNs utilize 2D filter matrices (kernels) that slide over the image to detect local features like edges, curves, or textures. This mechanism ensures translation equivariance — a feature learned in the top-left corner can be recognized if it appears in the bottom-right corner.
In this practical, we utilize the CIFAR-10 dataset, which presents a far more complex challenge than handwritten digits (MNIST). CIFAR-10 contains 60,000 color images representing 10 classes (airplanes, automobiles, birds, etc.). Because they are color images, the input shape is 3D: 32×32×3 (Height × Width × RGB Channels).
The architecture pairs each Convolutional layer with a MaxPooling2D layer. The Conv2D layer uses filters to extract high-dimensional feature maps, while the MaxPooling2D layer shrinks the spatial dimensions (downsampling) by picking the maximum value in a small 2x2 grid. This drastically reduces the computational load and introduces spatial invariance to minor shifts.
Tracing the data shape through our network:
1. Input: (32, 32, 3)
2. Conv2D (32 filters, 3x3): (30, 30, 32) (Dimensions shrink slightly because we aren't using padding).
3. MaxPool2D (2x2): (15, 15, 32)
4. Conv2D (64 filters, 3x3): (13, 13, 64)
5. MaxPool2D (2x2): (6, 6, 64)
6. Flatten: 6×6×64 = 2304 features.
7. Dense (64) → Dense (10, Softmax).
The final Dense layer uses a Softmax activation to convert the 10 final output values into a normalized probability distribution, where the highest probability dictates the model's prediction. The network is trained by minimizing the Sparse Categorical Cross-Entropy loss using the Adam optimizer.
32×32×3
32 filters
3×3, ReLU
2×2
64 filters
3×3, ReLU
2×2
→ Dense 64
ReLU
Softmax
Algorithm / Step-by-Step
- Import the
tensorflow.keraslibraries (datasets, layers, models). - Load the CIFAR-10 dataset using
datasets.cifar10.load_data(). - Normalize the pixel values from the range [0, 255] to [0.0, 1.0] by dividing the arrays by 255.0.
- Initialize a new
Sequential()model. - Add the First Convolutional Block:
Conv2Dwith 32 filters, 3x3 kernel size, ReLU activation, andinput_shape=(32, 32, 3).MaxPooling2Dwith a 2x2 pool size.
- Add the Second Convolutional Block:
Conv2Dwith 64 filters, 3x3 kernel size, and ReLU activation.MaxPooling2Dwith a 2x2 pool size.
- Add the Classification Head:
Flatten()to unroll the 3D feature maps into a 1D vector.Denselayer with 64 units and ReLU activation.Denseoutput layer with 10 units (for the 10 classes) and Softmax activation.
- Compile the model using the
'adam'optimizer,'sparse_categorical_crossentropy'loss function, and track'accuracy'. - Fit the model using
model.fit()on the training data for 10 epochs, passing the test set for validation. - Evaluate and plot the learning curves to analyze model performance.
Key Code Concepts
Snippet 1 — Data Loading and Preprocessing
from tensorflow.keras import datasets, layers, models # Load CIFAR-10 — returns numpy arrays (train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data() # Normalize pixel values to be between 0 and 1 train_images, test_images = train_images / 255.0, test_images / 255.0 print(f'Training images shape: {train_images.shape}') # Output: Training images shape: (50000, 32, 32, 3)
Unlike MNIST, the CIFAR-10 images are provided natively with 3 color channels (RGB), so their shape is already 4D: (samples, height, width, channels). This means no manual array reshaping with np.newaxis is required prior to feeding them into a Conv2D layer.
Snippet 2 — Building the 2-Layer CNN Architecture
model = models.Sequential() # First Convolutional Layer model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(layers.MaxPooling2D((2, 2))) # Second Convolutional Layer model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) # Flatten and Dense layers for classification model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10, activation='softmax')) # 10 classes
Notice that the input_shape is explicitly provided only in the very first layer. TensorFlow dynamically calculates the shape of all subsequent tensors automatically. The number of filters increases (32 to 64) as the spatial dimensions decrease, enabling the model to learn more complex feature combinations.
Snippet 3 — Compiling and Training
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Train the model history = model.fit(train_images, train_labels, epochs=10, validation_data=(test_images, test_labels))
Because the CIFAR-10 labels are provided as integers representing class indices (e.g., 0 for airplane, 1 for automobile) rather than one-hot encoded binary arrays, we strictly use the sparse_categorical_crossentropy loss function. The training loop returns a history dictionary which is useful for plotting.
Expected Output
model.summary(): A printed table displaying the layers, output shapes, and parameter counts. The total number of trainable parameters will be exactly 167,562.
- Conv1 Parameters: 32 filters * (3x3 * 3 channels) + 32 biases = 896
- Conv2 Parameters: 64 filters * (3x3 * 32 channels) + 64 biases = 18,496
- Dense1 Parameters: 64 units * 2304 inputs + 64 biases = 147,520
- Dense2 Parameters: 10 units * 64 inputs + 10 biases = 650
Training log (10 epochs): Because CIFAR-10 contains complex, noisy real-world images, the starting accuracy will be around ~35-40%. By epoch 10, the training accuracy should climb to approximately ~70-75%.
Accuracy plot: The plot will show training accuracy rising steadily. You may notice the validation accuracy plateauing slightly lower than the training accuracy, indicating the beginning stages of overfitting (which is normal for this simple, unregularized 2-layer network on CIFAR-10).
Viva Questions & Answers
1. Change the loading function to
datasets.mnist.load_data(). 2. Because MNIST images are flat 28x28 grayscale arrays without a color channel dimension, you must reshape them using
train_images = train_images[..., tf.newaxis] so they become 4D tensors compatible with Conv2D. 3. Change the
input_shape parameter in the first Conv2D layer from (32, 32, 3) to (28, 28, 1).
