Aim
To construct a deep Convolutional Neural Network (CNN) for multiclass image classification on the CIFAR-10 dataset, and to quantitatively demonstrate the training performance (speed) difference between a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU).
Prerequisites
Theory
Training deep Convolutional Neural Networks (CNNs) is a highly computationally intensive task. It involves performing millions of matrix multiplications, convolutions, and gradient updates across thousands of epochs. While modern Central Processing Units (CPUs) are incredibly fast, they typically have a small number of powerful cores (e.g., 8 to 24 cores) optimized for sequential serial processing and complex branching logic.
In contrast, Graphics Processing Units (GPUs) feature thousands of smaller, specialized cores designed explicitly for parallel processing. Because neural network operations (like matrix dot products) are highly parallelizable, GPUs can perform thousands of these calculations simultaneously. This architectural difference makes GPUs orders of magnitude faster for Deep Learning workloads.
TensorFlow automatically detects available GPUs and defaults to using them. However, for benchmarking
purposes, developers can explicitly assign operations to specific hardware using the
tf.device() context manager. By running the exact same model, data, and training loop
on both '/CPU:0' and '/GPU:0', we can calculate a precise speedup
multiplier.
The CNN architecture utilized in this practical incorporates advanced components to accelerate convergence and prevent overfitting: Batch Normalization, which normalizes the activations of the previous layer at each batch, and Dropout, which randomly deactivates a fraction of neurons during training.
Algorithm / Step-by-Step
- Import
tensorflow,time, andmatplotlib.pyplot. - Verify the presence of a GPU using
tf.config.list_physical_devices('GPU'). - Load the CIFAR-10 dataset and normalize the pixel values of both training and testing images to the [0, 1] range.
- Define a modular function
build_deep_cnn()that returns a freshly compiled CNN model. This ensures both hardware tests start with an identical, untrained architecture. - Initialize the model to contain blocks of
Conv2D,BatchNormalization,MaxPooling2D, andDropoutlayers, followed by aFlattenand Dense classification head. - CPU Training: Wrap the model initialization and the
model.fit()function inside awith tf.device('/CPU:0'):context. Use thetimemodule to record the start and end times. - GPU Training: Wrap the model initialization and the
model.fit()function inside awith tf.device('/GPU:0'):context. Record the start and end times. - Calculate the hardware speedup by dividing the elapsed CPU time by the elapsed GPU time.
- Generate a visual bar chart comparing the training times in seconds.
Key Code Concepts
Snippet 1 — Device Context Managers for Benchmarking
import time # Record start time start_time_cpu = time.time() # Force execution AND model initialization on CPU with tf.device('/CPU:0'): model_cpu = build_deep_cnn() history_cpu = model_cpu.fit(train_images, train_labels, epochs=3, batch_size=64, verbose=1) cpu_time = time.time() - start_time_cpu
The with tf.device() block acts as a strict directive to TensorFlow. It overrides any
automatic hardware allocation and forces the underlying C++ backend to allocate memory and execute
mathematical operations strictly on the specified processor. The build_deep_cnn()
function must be called inside the block so the initial weight tensors are created on the
correct device.
Snippet 2 — Modular Model Builder
def build_deep_cnn(): model = models.Sequential([ layers.Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)), layers.BatchNormalization(), layers.Conv2D(32, (3, 3), activation='relu', padding='same'), layers.MaxPooling2D((2, 2)), layers.Dropout(0.25), # ... additional Conv/Dense layers ... layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) return model
Defining the architecture inside a function is crucial for a fair A/B test. If we simply reused the same model variable, the GPU would start training from where the CPU left off (having already learned for 3 epochs), ruining the speed and accuracy comparison.
Expected Output
Device Detection: An initial printout confirming TensorFlow sees the GPU (e.g.,
GPU(s) detected: ['/physical_device:GPU:0']).
Training Logs: The CPU training logs will show significantly higher processing times per step (e.g., ~340ms/step) taking roughly 250 seconds per epoch. The GPU logs will blaze through at roughly ~18ms/step, taking under 10 seconds per epoch.
Performance Result: A printed statement detailing the multiplier, for example:
Result: The GPU was 22.02x faster than the CPU!
Visual Comparison Plot: A side-by-side bar chart clearly illustrating the massive disparity in total training time, with a tall red bar representing the CPU and a very short green bar representing the GPU.
