P07 — ANN with Different Activation Functions

Aim

To implement and compare the effect of three fundamental activation functions — Sigmoid, Tanh, and ReLU — on the training dynamics and convergence behavior of an Artificial Neural Network trained via backpropagation on the MNIST dataset. By keeping all other hyperparameters (architecture, optimizer settings, loss function, initialization) strictly identical and varying only the hidden-layer activation function, this practical isolates and quantifies the impact of activation function choice on learning speed and final model accuracy.

Prerequisites

Python & NumPy

TensorFlow / Keras

MNIST Dataset

Backpropagation

Activation Functions

Vanishing Gradient Problem

SGD Optimizer

Matplotlib Visualization

Theory

An activation function is a non-linear mathematical transformation applied to the weighted sum of inputs in a neural network neuron: a = f(z) = f(W·x + b). Without activation functions, a neural network would reduce to a linear model regardless of depth, since the composition of linear functions is itself linear. The non-linearity introduced by activation functions enables neural networks to approximate arbitrary continuous functions (the universal approximation theorem) and to learn complex decision boundaries for tasks like image classification, speech recognition, and natural language processing.

During backpropagation, the gradient of the loss with respect to each weight is computed using the chain rule. For a weight in layer l, the gradient is: ∂L/∂W^(l) = ∂L/∂a^(l) · f'(z^(l)) · x^(l−1)T. The term f'(z^(l)) — the derivative of the activation function — is critical. If this derivative is consistently small (close to zero), gradients diminish exponentially as they propagate backward through many layers, causing weights in early layers to update negligibly. This is the vanishing gradient problem, which prevents deep networks from learning effectively.

Sigmoid

σ(z) = 1 / (1 + e^−z)

Output Range: (0, 1)

S-shaped curve. Maximum derivative is 0.25 at z=0, causing severe gradient vanishing in deep networks. Outputs are not zero-centered.

Tanh

tanh(z) = (e^z − e^−z) / (e^z + e^−z)

Output Range: (−1, 1)

Zero-centered S-shaped curve. Max derivative is 1.0 at z=0, stronger gradients than Sigmoid. Still suffers from vanishing gradients for large |z|.

ReLU

ReLU(z) = max(0, z)

Output Range: [0, +∞)

Piecewise linear. Derivative is 1 for z>0, 0 for z<0. Eliminates vanishing gradient for positive activations. Computationally cheapest.

Sigmoid squashes any real-valued input into the range (0, 1), making it interpretable as a probability. Its derivative is σ'(z) = σ(z)(1 − σ(z)), which attains a maximum value of 0.25 at z = 0. For inputs far from zero (saturated neurons), the derivative approaches zero, causing gradients to vanish. Additionally, Sigmoid outputs are always positive (not zero-centered), which means gradients during backpropagation are either all positive or all negative, creating inefficient zigzag paths in the weight space.

Tanh (hyperbolic tangent) maps inputs to (−1, 1). Its derivative is tanh'(z) = 1 − tanh²(z), with a maximum of 1.0 at z = 0 — four times larger than Sigmoid's peak. This means Tanh propagates stronger gradients and typically converges faster. The zero-centered output also helps: both positive and negative gradients can flow, enabling more direct optimization paths. However, like Sigmoid, Tanh still saturates for large |z|, causing vanishing gradients in very deep networks.

ReLU (Rectified Linear Unit) is defined as f(z) = max(0, z). For positive inputs, it acts as a linear identity function with derivative 1 — the maximum possible. This completely eliminates the vanishing gradient problem for active (positive) neurons, allowing deep networks to train effectively. ReLU is also computationally trivial: no exponentials or divisions, just a threshold comparison. The trade-off is the dying ReLU problem: if a neuron's weights are initialized such that it always receives negative inputs, its gradient is permanently zero and the neuron never learns. Variants like Leaky ReLU address this with a small negative slope.

This practical uses Stochastic Gradient Descent (SGD) as the optimizer rather than Adam. SGD has a fixed learning rate and no momentum/RMSprop adaptations, making the optimizer behavior consistent across all three activation functions. This isolates the effect of activation choice: any differences in convergence speed and final accuracy are attributable solely to the activation function's gradient flow properties, not to optimizer-specific adaptive behaviors.

Algorithm / Step-by-Step

Import tensorflow and matplotlib.pyplot.
Load the MNIST dataset using tf.keras.datasets.mnist.load_data(), which returns training and test splits of 28×28 grayscale images with integer labels 0–9.
Normalize pixel values from [0, 255] to [0.0, 1.0] by dividing both train and test sets by 255.0. This scaling is essential for stable gradient flow with Sigmoid and Tanh activations.
Define a build_and_train_model(activation_function_name, epochs) function that:
- Prints the activation function being tested for clear logging.
- Builds a Sequential model with: Flatten(input_shape=(28,28)) → Dense(128, activation=activation_function_name) → Dense(10, activation='softmax').
- Compiles with optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'].
- Trains with model.fit() for the specified epochs, using validation_split=0.1 (10% of training data held out for validation) and verbose=0 for clean output.
- Evaluates on the test set and prints final test accuracy.
- Returns the history object containing per-epoch metrics.
Create a list of activation functions to test: ['sigmoid', 'tanh', 'relu'].
Iterate over each activation function, call build_and_train_model() with epochs=15, and store the returned history in a dictionary keyed by function name.
Create a Matplotlib figure (10×6 inches) and plot the validation accuracy curve for each activation function on the same axes, with a legend, grid, and descriptive title.
Analyze the plot: ReLU should show the steepest initial rise and highest final accuracy, Tanh should converge faster than Sigmoid but slower than ReLU, and Sigmoid should exhibit the slowest convergence due to its weak gradient signal.

Key Code Concepts

Snippet 1 — Data Loading and Normalization

# Load MNIST — 60,000 train + 10,000 test grayscale images
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize to [0, 1] — critical for Sigmoid/Tanh saturation
x_train, x_test = x_train / 255.0, x_test / 255.0

print(f'Training data shape: {x_train.shape}')
# Output: (60000, 28, 28)

Normalization to [0, 1] is especially important when using Sigmoid and Tanh activations. Sigmoid outputs range (0, 1), and if raw pixel values [0, 255] were fed directly, the weighted sum z = W·x + b would be extremely large, pushing the Sigmoid deep into saturation where its derivative is nearly zero. Normalization keeps z in a reasonable range, ensuring the activation operates in its high-gradient region near z = 0. ReLU is less sensitive to input scale but still benefits from normalization for stable training.

Snippet 2 — Controlled Model Builder Function

def build_and_train_model(activation_name, epochs=15):
    print(f"\n--- Training with {activation_name.upper()} --")

    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        # Hidden layer: 128 neurons, injectable activation
        tf.keras.layers.Dense(128, activation=activation_name),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer='sgd',                    # Fixed learning rate, no adaptivity
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    history = model.fit(
        x_train, y_train,
        epochs=epochs,
        validation_split=0.1,            # 6,000 images for validation
        verbose=0
    )

    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
    print(f"Test Accuracy ({activation_name}): {test_acc:.4f}")

    return history

The build_and_train_model function enforces a controlled experiment: every hyperparameter except the activation function is held constant. The architecture is a simple one-hidden-layer network (Flatten → Dense 128 → Dense 10) with approximately 101,770 trainable parameters. Using SGD instead of Adam is deliberate — Adam's adaptive learning rates could mask the differences between activation functions by compensating for weak gradients. SGD's fixed learning rate makes the gradient flow properties of each activation directly observable in the learning curves.

Snippet 3 — Training All Three Models and Comparing

# Test all three activation functions
activation_functions = ['sigmoid', 'tanh', 'relu']
histories = {}

for func in activation_functions:
    histories[func] = build_and_train_model(func, epochs=15)

# ── Results (typical) ──
# Test Accuracy (sigmoid): ~0.9200
# Test Accuracy (tanh):    ~0.9650
# Test Accuracy (relu):    ~0.9750

The function is called three times, each with a different activation string. Keras resolves the string to the corresponding activation function internally. The histories dictionary maps each function name to its training history object, which contains per-epoch arrays for loss, accuracy, val_loss, and val_accuracy. Storing all three histories enables direct comparison on the same plot. Typical results show ReLU achieving the highest test accuracy (~97–98%), Tanh performing slightly lower (~96–97%), and Sigmoid lagging behind (~92–94%) after 15 epochs.

Snippet 4 — Visualization of Validation Accuracy

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

for func in activation_functions:
    plt.plot(
        histories[func].history['val_accuracy'],
        label=func.capitalize()
    )

plt.title('Validation Accuracy Comparison by Activation Function')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

The plot visualizes the convergence dynamics of each activation function. The ReLU curve typically shows the steepest initial ascent in the first 2–3 epochs, rapidly reaching high validation accuracy. The Tanh curve rises more gradually but consistently outperforms Sigmoid. The Sigmoid curve has the shallowest slope, reflecting the weak gradient signals (max derivative 0.25) that slow weight updates. The loc="lower right" legend placement keeps it visible since all curves monotonically increase toward the top of the plot.

Expected Output

Console Output: Three training logs print sequentially, each showing the activation function name and final test accuracy. Expected approximate values after 15 epochs of SGD:

Activation	Test Accuracy	Convergence Speed	Key Characteristic
Sigmoid	~0.92 – 0.94	Slowest	Weak gradients (max σ' = 0.25), non-zero-centered output
Tanh	~0.96 – 0.97	Moderate	Stronger gradients (max tanh' = 1.0), zero-centered output
ReLU	~0.97 – 0.98	Fastest	No vanishing gradient for z > 0, computationally efficient

Validation Accuracy Plot: A 10×6 inch line plot with three curves:

ReLU (purple/blue): Steepest initial rise, reaching ~95% validation accuracy by epoch 5 and plateauing near ~97–98% by epoch 15.
Tanh (cyan): Moderate slope, reaching ~94% by epoch 5 and plateauing near ~96–97%. The curve is smooth and consistently above Sigmoid.
Sigmoid (orange): Shallowest slope, slowly climbing to ~92–94% by epoch 15. The curve may show more fluctuation due to weaker gradient signals.

All curves should be monotonically increasing (or nearly so), with the gap between ReLU and Sigmoid widening in the early epochs and stabilizing in later epochs as all models approach their respective convergence plateaus.

Viva Questions & Answers

Q1. Why does ReLU converge faster than Sigmoid and Tanh?

ReLU converges faster because its derivative is 1 for all positive inputs, whereas Sigmoid's maximum derivative is only 0.25 and Tanh's maximum is 1.0. During backpropagation, the gradient flowing through a ReLU neuron is preserved at full strength (multiplied by 1) as long as the neuron is active (z > 0). In contrast, Sigmoid's gradient is always attenuated by at least a factor of 4, and for saturated neurons (large |z|), it becomes vanishingly small. This stronger gradient signal means ReLU weights update more aggressively, leading to faster learning. Additionally, ReLU involves no exponential computations, making each forward and backward pass computationally cheaper.

Q2. What is the vanishing gradient problem and which activation functions suffer from it?

The vanishing gradient problem occurs during backpropagation when gradients become progressively smaller as they flow backward through many layers, causing early-layer weights to update negligibly. Both Sigmoid and Tanh suffer from this because they saturate: for large positive or negative inputs, their derivatives approach zero. Sigmoid is worse because its maximum derivative is only 0.25, so gradients are attenuated by at least 75% at every layer. Tanh is better (max derivative = 1.0) but still saturates. ReLU avoids this for positive inputs since its derivative is exactly 1, but it has its own issue: the "dying ReLU" problem, where neurons with consistently negative inputs have zero gradient and never learn.

Q3. Why is the output of Sigmoid not zero-centered and why does this matter?

Sigmoid outputs are always in the range (0, 1), meaning they are always positive. During backpropagation, the gradient with respect to the weights of a Sigmoid layer has the same sign as the gradient from the next layer (since ∂L/∂W = ∂L/∂a · f'(z) · x, and f'(z) · x is always positive). This means all weight updates for a given neuron are either all positive or all negative, forcing the optimizer to take zigzag paths through the weight space to reach optimal values. Tanh outputs are zero-centered (−1 to 1), allowing both positive and negative gradients, which enables more direct convergence paths. This is one reason Tanh typically outperforms Sigmoid.

Q4. Why was SGD chosen as the optimizer instead of Adam for this comparison?

SGD with a fixed learning rate was chosen to isolate the effect of activation function choice. Adam is an adaptive optimizer that independently adjusts the learning rate for each parameter based on first and second moment estimates of the gradients. Adam's adaptivity could compensate for weak gradients (as with Sigmoid) by increasing effective step sizes, potentially masking the true differences between activation functions. SGD's fixed learning rate makes the gradient flow properties of each activation directly observable in the learning curves — any differences in convergence speed and final accuracy are purely attributable to the activation function's derivative behavior during backpropagation.

Q5. What is the dying ReLU problem and how is it addressed?

The dying ReLU problem occurs when a ReLU neuron's weights are updated such that the weighted sum z = W·x + b is consistently negative for all training inputs. Since ReLU outputs 0 for z < 0 and its derivative is also 0 in this region, the gradient flowing back to this neuron is permanently zero. The neuron becomes "dead" — it never fires and never learns. This is particularly problematic with high learning rates or poor weight initialization. Solutions include: (1) Leaky ReLU, which uses a small negative slope α (e.g., 0.01) for z < 0: f(z) = max(αz, z); (2) Parametric ReLU (PReLU), where α is learned; (3) Better weight initialization schemes like He initialization, which accounts for ReLU's sparsity; and (4) using a lower learning rate to prevent large negative weight updates.

NIELIT Ropar

ANN using Backpropagation
with Sigmoid, Tanh & ReLU Activation Functions

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

ANN using Backpropagationwith Sigmoid, Tanh & ReLU Activation Functions

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

ANN using Backpropagation
with Sigmoid, Tanh & ReLU Activation Functions