P10 — Autoencoder CIFAR-10

Aim

To design a Convolutional Autoencoder neural network to compress and reconstruct images from the CIFAR-10 dataset, and to visually compare the original input images against the model's reconstructed outputs.

Prerequisites

Python Programming

TensorFlow / Keras

Unsupervised Learning

Autoencoder Architecture

Convolutional Layers

CIFAR-10 Dataset

Theory

An Autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal "noise". Unlike classification models that require human-labeled targets, autoencoders use the input data itself as the target.

The architecture is comprised of two distinct parts that work sequentially:

The Encoder: This part compresses the input data into a lower-dimensional latent-space representation (often called a bottleneck). In Convolutional Autoencoders, this is achieved using Conv2D and MaxPooling2D layers, which gradually decrease the spatial resolution of the image while capturing high-level feature maps.
The Decoder: This part aims to reconstruct the original input from the compressed latent-space representation. It mirrors the encoder, utilizing Conv2D and UpSampling2D (or Transposed Convolution) layers to progressively expand the spatial dimensions back to the original input shape.

In this practical, we utilize the CIFAR-10 dataset, which contains 60,000 32x32 color images. The objective is to evaluate how well the autoencoder can "squeeze" these 32x32x3 (3,072 pixels) images down to an 8x8 representation and reliably inflate them back into recognizable color images without extreme loss of detail.

Algorithm / Step-by-Step

Import tensorflow, keras.layers, keras.models, numpy, and matplotlib.pyplot.
Load the CIFAR-10 dataset. Discard the y_train and y_test labels since this is an unsupervised learning task.
Normalize the pixel values of the training and testing sets to be in the range [0.0, 1.0] by dividing by 255.0.
Define a Sequential model.
Construct Encoder: Add Conv2D layers with ReLU activation followed by MaxPooling2D to downsample the images from 32x32 to 16x16, and then to a bottleneck of 8x8.
Construct Decoder: Add Conv2D layers with ReLU activation followed by UpSampling2D to upscale the bottleneck from 8x8 to 16x16, and finally back to 32x32.
Add a final Conv2D layer with 3 filters and a sigmoid activation function to generate the final 3-channel (RGB) image bounded between 0 and 1.
Compile the autoencoder using the adam optimizer and mse (Mean Squared Error) loss function.
Train the model using model.fit(), passing x_train as both the input (x) and target (y) parameters. Use the test set for validation.
Generate reconstructed images by calling model.predict() on a subset of the test data.
Plot the original input images alongside their corresponding reconstructed outputs using Matplotlib to visually compare the data retention.

Key Code Concepts

Snippet 1 — Autoencoder Architecture (Encoder + Decoder)

autoencoder = models.Sequential([
    # ENCODER
    layers.InputLayer(input_shape=(32, 32, 3)),
    layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2), padding='same'), # Compresses to 16x16
    layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2), padding='same'), # Compresses to 8x8 (Bottleneck)

    # DECODER
    layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
    layers.UpSampling2D((2, 2)),                 # Decompresses to 16x16
    layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
    layers.UpSampling2D((2, 2)),                 # Decompresses to 32x32
    layers.Conv2D(3, (3, 3), activation='sigmoid', padding='same') # 3 channels for RGB
])

This symmetric structure forces the network to learn a compressed representation. The MaxPooling2D layers aggressively halve the resolution, creating the bottleneck. The UpSampling2D layers reverse this by duplicating rows and columns. The final sigmoid activation ensures pixel outputs map cleanly to the [0, 1] range we scaled our data into.

Snippet 2 — Compiling and Self-Supervised Training

autoencoder.compile(optimizer='adam', loss='mse')

# Notice x_train is passed for both input AND target
history = autoencoder.fit(x_train, x_train,
                          epochs=10,
                          batch_size=128,
                          validation_data=(x_test, x_test),
                          verbose=1)

Because the network is attempting to recreate the input, we do not use y_train labels. By passing x_train, x_train, we instruct the model to calculate its Mean Squared Error (MSE) loss based on the pixel-by-pixel difference between the original image and the reconstructed image.

Expected Output

Model Summary: A table tracing the dimensions down from (32, 32, 3) to a bottleneck of (8, 8, 16), and back up to (32, 32, 3).

Training Logs: Epoch printouts showing decreasing Mean Squared Error (MSE) loss, indicating the model is getting better at reconstructing images.

Visual Comparison Plot: A generated figure with two rows of images. The top row will display crisp, original images from the CIFAR-10 test set. The bottom row will show the model's reconstructions of those exact images. The reconstructions will look similar in shape and color but slightly blurrier, reflecting the information lost during the compression bottleneck.

Viva Questions & Answers

Q1. What is an autoencoder and what is its primary purpose?

An autoencoder is an unsupervised neural network that learns to compress data into a lower-dimensional latent representation (the bottleneck) and then reconstruct it back to its original form. Its primary purposes include dimensionality reduction, feature extraction, and image denoising.

Q2. Why do we pass x_train as both the input and target variable during model training?

Because autoencoders are a form of self-supervised learning. The goal of the network is to minimize the reconstruction error between the original input image and the image outputted by the decoder. Therefore, the input image itself serves as the perfect "ground truth" label.

Q3. Explain the distinct roles of the Encoder and the Decoder.

The Encoder compresses the high-dimensional input into a low-dimensional latent space to extract the most important, compressed features. The Decoder takes this compressed representation and attempts to upscale and reconstruct the original input as accurately as possible.

Q4. Why use a 'sigmoid' activation in the final layer of the decoder?

Before training, the input image pixels were normalized to fall between 0.0 and 1.0. The sigmoid mathematical function squashes arbitrary real values strictly into the [0, 1] range, ensuring that the reconstructed output directly matches the expected scale of normalized image pixels.

Q5. How does MaxPooling2D differ from UpSampling2D in this architecture?

MaxPooling2D downsamples the spatial dimensions (e.g., halving the size from 32x32 to 16x16) by taking the maximum value over a spatial window, forcing compression. Conversely, UpSampling2D reverses this by repeating rows and columns of data, increasing the spatial dimensions back towards their original size.

NIELIT Ropar

Image Data Comparison using Autoencoder
Neural Network on CIFAR-10

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Image Data Comparison using AutoencoderNeural Network on CIFAR-10

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Image Data Comparison using Autoencoder
Neural Network on CIFAR-10