P11 — CPU vs GPU Performance

Aim

To construct a deep Convolutional Neural Network (CNN) for multiclass image classification on the CIFAR-10 dataset, and to quantitatively demonstrate the training performance (speed) difference between a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU).

Prerequisites

Python Programming

TensorFlow / Keras

CNN Architectures

Hardware Acceleration

Multiclass Classification

CIFAR-10 Dataset

Theory

Training deep Convolutional Neural Networks (CNNs) is a highly computationally intensive task. It involves performing millions of matrix multiplications, convolutions, and gradient updates across thousands of epochs. While modern Central Processing Units (CPUs) are incredibly fast, they typically have a small number of powerful cores (e.g., 8 to 24 cores) optimized for sequential serial processing and complex branching logic.

In contrast, Graphics Processing Units (GPUs) feature thousands of smaller, specialized cores designed explicitly for parallel processing. Because neural network operations (like matrix dot products) are highly parallelizable, GPUs can perform thousands of these calculations simultaneously. This architectural difference makes GPUs orders of magnitude faster for Deep Learning workloads.

TensorFlow automatically detects available GPUs and defaults to using them. However, for benchmarking purposes, developers can explicitly assign operations to specific hardware using the tf.device() context manager. By running the exact same model, data, and training loop on both '/CPU:0' and '/GPU:0', we can calculate a precise speedup multiplier.

The CNN architecture utilized in this practical incorporates advanced components to accelerate convergence and prevent overfitting: Batch Normalization, which normalizes the activations of the previous layer at each batch, and Dropout, which randomly deactivates a fraction of neurons during training.

Algorithm / Step-by-Step

Import tensorflow, time, and matplotlib.pyplot.
Verify the presence of a GPU using tf.config.list_physical_devices('GPU').
Load the CIFAR-10 dataset and normalize the pixel values of both training and testing images to the [0, 1] range.
Define a modular function build_deep_cnn() that returns a freshly compiled CNN model. This ensures both hardware tests start with an identical, untrained architecture.
Initialize the model to contain blocks of Conv2D, BatchNormalization, MaxPooling2D, and Dropout layers, followed by a Flatten and Dense classification head.
CPU Training: Wrap the model initialization and the model.fit() function inside a with tf.device('/CPU:0'): context. Use the time module to record the start and end times.
GPU Training: Wrap the model initialization and the model.fit() function inside a with tf.device('/GPU:0'): context. Record the start and end times.
Calculate the hardware speedup by dividing the elapsed CPU time by the elapsed GPU time.
Generate a visual bar chart comparing the training times in seconds.

Key Code Concepts

Snippet 1 — Device Context Managers for Benchmarking

import time

# Record start time
start_time_cpu = time.time()

# Force execution AND model initialization on CPU
with tf.device('/CPU:0'):
    model_cpu = build_deep_cnn()
    history_cpu = model_cpu.fit(train_images, train_labels,
                                epochs=3, batch_size=64, verbose=1)

cpu_time = time.time() - start_time_cpu

The with tf.device() block acts as a strict directive to TensorFlow. It overrides any automatic hardware allocation and forces the underlying C++ backend to allocate memory and execute mathematical operations strictly on the specified processor. The build_deep_cnn() function must be called inside the block so the initial weight tensors are created on the correct device.

Snippet 2 — Modular Model Builder

def build_deep_cnn():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)),
        layers.BatchNormalization(),
        layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),
        # ... additional Conv/Dense layers ...
        layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

Defining the architecture inside a function is crucial for a fair A/B test. If we simply reused the same model variable, the GPU would start training from where the CPU left off (having already learned for 3 epochs), ruining the speed and accuracy comparison.

Expected Output

Device Detection: An initial printout confirming TensorFlow sees the GPU (e.g., GPU(s) detected: ['/physical_device:GPU:0']).

Training Logs: The CPU training logs will show significantly higher processing times per step (e.g., ~340ms/step) taking roughly 250 seconds per epoch. The GPU logs will blaze through at roughly ~18ms/step, taking under 10 seconds per epoch.

Performance Result: A printed statement detailing the multiplier, for example: Result: The GPU was 22.02x faster than the CPU!

Visual Comparison Plot: A side-by-side bar chart clearly illustrating the massive disparity in total training time, with a tall red bar representing the CPU and a very short green bar representing the GPU.

Viva Questions & Answers

Q1. Why are GPUs significantly faster than CPUs for training Deep Learning models?

While CPUs have a few very fast cores designed for complex, sequential tasks, GPUs are composed of thousands of smaller, specialized cores. Because neural network training heavily relies on matrix multiplications and convolutions—which are highly parallelizable math operations—the GPU can process thousands of these calculations simultaneously.

Q2. What is the purpose of the tf.device('/CPU:0') context manager?

By default, TensorFlow will automatically prioritize running operations on a GPU if one is detected. The tf.device() context manager overrides this behavior, explicitly forcing TensorFlow to allocate tensors and execute computational graphs on the specified hardware (in this case, the CPU) for benchmarking or debugging purposes.

Q3. Why is it important to wrap the model creation inside the build_deep_cnn() function rather than creating one model globally?

To conduct a fair performance comparison, both the CPU and GPU must train an identically structured model starting from completely randomized weights. Calling the builder function inside each hardware's execution block guarantees that a fresh, un-trained model is generated and properly allocated to that specific processor's memory.

Q4. What is the role of the BatchNormalization layer in this CNN?

Batch Normalization normalizes the activations of the previous layer (bringing the mean close to 0 and standard deviation close to 1) for each mini-batch during training. This stabilizes the learning process, mitigates the vanishing/exploding gradient problem, and generally allows the network to converge faster.

Q5. In the dataset used (CIFAR-10), what do the dimensions (50000, 32, 32, 3) signify?

These dimensions represent the shape of the training data tensor. '50000' is the total number of images in the training set. '32' and '32' represent the spatial height and width of each image in pixels. The '3' signifies the number of color channels (Red, Green, Blue) making it a full-color image.

NIELIT Ropar

Deep CNN for Multiclass Image Classification
and CPU vs GPU Comparison

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Deep CNN for Multiclass Image Classificationand CPU vs GPU Comparison

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Deep CNN for Multiclass Image Classification
and CPU vs GPU Comparison