P02 — Classification of MNIST Handwritten Digits

Aim

To build and train a fully connected Artificial Neural Network (ANN) using Keras to classify handwritten digits from the MNIST dataset, achieving high accuracy on the test set.

Prerequisites

Neural Networks Basics

Activation Functions

Softmax & Cross-Entropy

Keras Sequential API

MNIST Dataset (P01)

Backpropagation

Theory

A fully connected (Dense) Neural Network is the foundational architecture in deep learning. Each neuron in a layer connects to every neuron in the next layer. The network learns by adjusting weights and biases through backpropagation to minimize a loss function. For MNIST, the task is 10-class classification of digit images.

The architecture typically consists of: (1) an input layer accepting a 784-dimensional flattened vector, (2) one or more hidden layers with ReLU activation, and (3) an output layer with 10 neurons and softmax activation. The ReLU (Rectified Linear Unit) activation f(x) = max(0, x) introduces non-linearity and mitigates vanishing gradient issues compared to sigmoid/tanh.

The output layer uses softmax, defined as softmax(z_i) = exp(z_i) / Σ exp(z_j), which converts raw logits into a probability distribution summing to 1 over all 10 classes. The predicted class is the index with the highest probability. The categorical cross-entropy loss L = -Σ y_i * log(p_i) quantifies the difference between predicted probabilities and one-hot encoded true labels.

Training uses the Adam optimizer, an adaptive learning rate method that maintains per-parameter learning rates based on estimates of first and second moments of gradients. Adam combines the benefits of RMSProp and momentum, making it robust and fast to converge. Training proceeds in epochs, where one epoch represents a full pass over the training set in mini-batches.

Key metrics include accuracy (fraction of correct predictions) and loss (cross-entropy value). The model is evaluated on the held-out test set to estimate generalization. Plotting training vs. validation curves reveals overfitting (gap between train and val accuracy).

Algorithm / Step-by-Step

Load and normalize MNIST data (divide by 255.0).
Flatten training and test images from (N, 28, 28) to (N, 784).
One-hot encode labels using keras.utils.to_categorical(y, 10).
Build the Sequential model: Flatten → Dense(128, relu) → Dense(64, relu) → Dense(10, softmax).
Compile with optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'].
Train using model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.1).
Evaluate on test set with model.evaluate(x_test, y_test).
Plot training/validation accuracy and loss curves over epochs.
Display a confusion matrix of test predictions vs true labels.

Key Code Concepts

Snippet 1 — Building the Model

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Flatten(input_shape=(28, 28)),       # 784 inputs
    layers.Dense(128, activation='relu'),        # hidden layer 1
    layers.Dense(64,  activation='relu'),        # hidden layer 2
    layers.Dense(10,  activation='softmax')     # output layer
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

Using sparse_categorical_crossentropy avoids the need for one-hot encoding — Keras handles integer labels directly. The Flatten layer converts each 28×28 image to a flat 784-element vector automatically.

Snippet 2 — Training and Evaluation

history = model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=128,
    validation_split=0.1,
    verbose=1
)

test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_acc:.4f}")

The history object stores per-epoch metrics (loss, accuracy, val_loss, val_accuracy), which can be plotted to diagnose overfitting or underfitting.

Snippet 3 — Confusion Matrix

import numpy as np
from sklearn.metrics import confusion_matrix
import seaborn as sns

y_pred = np.argmax(model.predict(x_test), axis=1)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

The confusion matrix shows which digits are misclassified most often — e.g., digit 4 and 9 are commonly confused due to visual similarity.

Expected Output

Training Progress: Per-epoch output showing loss decreasing and accuracy increasing. Expect ~97–98% training accuracy after 10 epochs.

Test Accuracy: Typically ~97–98% on the MNIST test set with a 2-layer Dense network.

Loss/Accuracy Curves: Smooth decreasing loss curves and increasing accuracy for both train and validation sets, with minimal gap indicating good generalization.

Confusion Matrix: A 10×10 heatmap showing most predictions on the diagonal (correct), with minor off-diagonal errors mostly among visually similar digits (3/8, 4/9).

Viva Questions & Answers

Q1. What is the role of the Flatten layer in the MNIST model?

The Flatten layer reshapes the 2D input (28×28) into a 1D vector of 784 elements. Dense layers expect 1D input, so this transformation is necessary. It has no learnable parameters — it simply reorganizes data in memory.

Q2. Why is softmax used in the output layer for MNIST classification?

Softmax converts the raw output scores (logits) from the last Dense layer into a probability distribution over 10 classes. Each output value is in (0,1) and all outputs sum to 1.0, making them interpretable as class probabilities. The predicted class is argmax of this distribution.

Q3. What is the difference between sparse_categorical_crossentropy and categorical_crossentropy?

Both compute the same mathematical loss, but differ in how labels are provided. sparse_categorical_crossentropy expects integer class labels (e.g., 5), while categorical_crossentropy expects one-hot encoded labels (e.g., [0,0,0,0,0,1,0,0,0,0]). Use sparse for integer labels to save memory and preprocessing steps.

Q4. What does the validation_split parameter do during training?

validation_split=0.1 reserves 10% of the training data as a validation set that is not used for weight updates. After each epoch, the model evaluates on this set to track generalization. A growing gap between training and validation accuracy indicates overfitting.

Q5. How does the Adam optimizer differ from standard SGD?

Standard SGD uses a single global learning rate for all parameters. Adam maintains individual adaptive learning rates per parameter, using exponentially decaying averages of past gradients (first moment) and squared gradients (second moment). This makes Adam faster to converge and less sensitive to hyperparameter tuning, especially for problems with sparse gradients.

Classification ofMNIST Handwritten Digits