Aim
To implement and compare the effect of three fundamental activation functions — Sigmoid, Tanh, and ReLU — on the training dynamics and convergence behavior of an Artificial Neural Network trained via backpropagation on the MNIST dataset. By keeping all other hyperparameters (architecture, optimizer settings, loss function, initialization) strictly identical and varying only the hidden-layer activation function, this practical isolates and quantifies the impact of activation function choice on learning speed and final model accuracy.
Prerequisites
Theory
An activation function is a non-linear mathematical transformation applied to the weighted sum of inputs in a neural network neuron: a = f(z) = f(W·x + b). Without activation functions, a neural network would reduce to a linear model regardless of depth, since the composition of linear functions is itself linear. The non-linearity introduced by activation functions enables neural networks to approximate arbitrary continuous functions (the universal approximation theorem) and to learn complex decision boundaries for tasks like image classification, speech recognition, and natural language processing.
During backpropagation, the gradient of the loss with respect to each weight is computed using the chain rule. For a weight in layer l, the gradient is: ∂L/∂W(l) = ∂L/∂a(l) · f'(z(l)) · x(l−1)T. The term f'(z(l)) — the derivative of the activation function — is critical. If this derivative is consistently small (close to zero), gradients diminish exponentially as they propagate backward through many layers, causing weights in early layers to update negligibly. This is the vanishing gradient problem, which prevents deep networks from learning effectively.
Sigmoid squashes any real-valued input into the range (0, 1), making it interpretable as a probability. Its derivative is σ'(z) = σ(z)(1 − σ(z)), which attains a maximum value of 0.25 at z = 0. For inputs far from zero (saturated neurons), the derivative approaches zero, causing gradients to vanish. Additionally, Sigmoid outputs are always positive (not zero-centered), which means gradients during backpropagation are either all positive or all negative, creating inefficient zigzag paths in the weight space.
Tanh (hyperbolic tangent) maps inputs to (−1, 1). Its derivative is tanh'(z) = 1 − tanh²(z), with a maximum of 1.0 at z = 0 — four times larger than Sigmoid's peak. This means Tanh propagates stronger gradients and typically converges faster. The zero-centered output also helps: both positive and negative gradients can flow, enabling more direct optimization paths. However, like Sigmoid, Tanh still saturates for large |z|, causing vanishing gradients in very deep networks.
ReLU (Rectified Linear Unit) is defined as f(z) = max(0, z). For positive inputs, it acts as a linear identity function with derivative 1 — the maximum possible. This completely eliminates the vanishing gradient problem for active (positive) neurons, allowing deep networks to train effectively. ReLU is also computationally trivial: no exponentials or divisions, just a threshold comparison. The trade-off is the dying ReLU problem: if a neuron's weights are initialized such that it always receives negative inputs, its gradient is permanently zero and the neuron never learns. Variants like Leaky ReLU address this with a small negative slope.
This practical uses Stochastic Gradient Descent (SGD) as the optimizer rather than Adam. SGD has a fixed learning rate and no momentum/RMSprop adaptations, making the optimizer behavior consistent across all three activation functions. This isolates the effect of activation choice: any differences in convergence speed and final accuracy are attributable solely to the activation function's gradient flow properties, not to optimizer-specific adaptive behaviors.
Algorithm / Step-by-Step
- Import
tensorflowandmatplotlib.pyplot. - Load the MNIST dataset using
tf.keras.datasets.mnist.load_data(), which returns training and test splits of 28×28 grayscale images with integer labels 0–9. - Normalize pixel values from [0, 255] to [0.0, 1.0] by dividing both train and test sets by 255.0. This scaling is essential for stable gradient flow with Sigmoid and Tanh activations.
- Define a
build_and_train_model(activation_function_name, epochs)function that:- Prints the activation function being tested for clear logging.
- Builds a
Sequentialmodel with:Flatten(input_shape=(28,28))→Dense(128, activation=activation_function_name)→Dense(10, activation='softmax'). - Compiles with
optimizer='sgd',loss='sparse_categorical_crossentropy',metrics=['accuracy']. - Trains with
model.fit()for the specified epochs, usingvalidation_split=0.1(10% of training data held out for validation) andverbose=0for clean output. - Evaluates on the test set and prints final test accuracy.
- Returns the
historyobject containing per-epoch metrics.
- Create a list of activation functions to test:
['sigmoid', 'tanh', 'relu']. - Iterate over each activation function, call
build_and_train_model()withepochs=15, and store the returned history in a dictionary keyed by function name. - Create a Matplotlib figure (10×6 inches) and plot the validation accuracy curve for each activation function on the same axes, with a legend, grid, and descriptive title.
- Analyze the plot: ReLU should show the steepest initial rise and highest final accuracy, Tanh should converge faster than Sigmoid but slower than ReLU, and Sigmoid should exhibit the slowest convergence due to its weak gradient signal.
Key Code Concepts
Snippet 1 — Data Loading and Normalization
# Load MNIST — 60,000 train + 10,000 test grayscale images mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() # Normalize to [0, 1] — critical for Sigmoid/Tanh saturation x_train, x_test = x_train / 255.0, x_test / 255.0 print(f'Training data shape: {x_train.shape}') # Output: (60000, 28, 28)
Normalization to [0, 1] is especially important when using Sigmoid and Tanh activations. Sigmoid outputs range (0, 1), and if raw pixel values [0, 255] were fed directly, the weighted sum z = W·x + b would be extremely large, pushing the Sigmoid deep into saturation where its derivative is nearly zero. Normalization keeps z in a reasonable range, ensuring the activation operates in its high-gradient region near z = 0. ReLU is less sensitive to input scale but still benefits from normalization for stable training.
Snippet 2 — Controlled Model Builder Function
def build_and_train_model(activation_name, epochs=15): print(f"\n--- Training with {activation_name.upper()} --") model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), # Hidden layer: 128 neurons, injectable activation tf.keras.layers.Dense(128, activation=activation_name), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile( optimizer='sgd', # Fixed learning rate, no adaptivity loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) history = model.fit( x_train, y_train, epochs=epochs, validation_split=0.1, # 6,000 images for validation verbose=0 ) test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0) print(f"Test Accuracy ({activation_name}): {test_acc:.4f}") return history
The build_and_train_model function enforces a controlled experiment:
every hyperparameter except the activation function is held constant. The architecture is a simple
one-hidden-layer network (Flatten → Dense 128 → Dense 10) with approximately
101,770 trainable parameters. Using SGD instead of Adam is deliberate — Adam's adaptive
learning rates could mask the differences between activation functions by compensating for weak
gradients. SGD's fixed learning rate makes the gradient flow properties of each activation directly
observable in the learning curves.
Snippet 3 — Training All Three Models and Comparing
# Test all three activation functions activation_functions = ['sigmoid', 'tanh', 'relu'] histories = {} for func in activation_functions: histories[func] = build_and_train_model(func, epochs=15) # ── Results (typical) ── # Test Accuracy (sigmoid): ~0.9200 # Test Accuracy (tanh): ~0.9650 # Test Accuracy (relu): ~0.9750
The function is called three times, each with a different activation string. Keras resolves the string to the corresponding activation function internally. The histories dictionary maps each function name to its training history object, which contains per-epoch arrays for loss, accuracy, val_loss, and val_accuracy. Storing all three histories enables direct comparison on the same plot. Typical results show ReLU achieving the highest test accuracy (~97–98%), Tanh performing slightly lower (~96–97%), and Sigmoid lagging behind (~92–94%) after 15 epochs.
Snippet 4 — Visualization of Validation Accuracy
import matplotlib.pyplot as plt plt.figure(figsize=(10, 6)) for func in activation_functions: plt.plot( histories[func].history['val_accuracy'], label=func.capitalize() ) plt.title('Validation Accuracy Comparison by Activation Function') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend(loc="lower right") plt.grid(True) plt.show()
The plot visualizes the convergence dynamics of each activation function. The ReLU curve typically
shows the steepest initial ascent in the first 2–3 epochs, rapidly reaching high validation
accuracy. The Tanh curve rises more gradually but consistently outperforms Sigmoid. The Sigmoid
curve has the shallowest slope, reflecting the weak gradient signals (max derivative 0.25) that
slow weight updates. The loc="lower right" legend placement keeps it visible since
all curves monotonically increase toward the top of the plot.
Expected Output
Console Output: Three training logs print sequentially, each showing the activation function name and final test accuracy. Expected approximate values after 15 epochs of SGD:
| Activation | Test Accuracy | Convergence Speed | Key Characteristic |
|---|---|---|---|
| Sigmoid | ~0.92 – 0.94 | Slowest | Weak gradients (max σ' = 0.25), non-zero-centered output |
| Tanh | ~0.96 – 0.97 | Moderate | Stronger gradients (max tanh' = 1.0), zero-centered output |
| ReLU | ~0.97 – 0.98 | Fastest | No vanishing gradient for z > 0, computationally efficient |
Validation Accuracy Plot: A 10×6 inch line plot with three curves:
- ReLU (purple/blue): Steepest initial rise, reaching ~95% validation accuracy by epoch 5 and plateauing near ~97–98% by epoch 15.
- Tanh (cyan): Moderate slope, reaching ~94% by epoch 5 and plateauing near ~96–97%. The curve is smooth and consistently above Sigmoid.
- Sigmoid (orange): Shallowest slope, slowly climbing to ~92–94% by epoch 15. The curve may show more fluctuation due to weaker gradient signals.
All curves should be monotonically increasing (or nearly so), with the gap between ReLU and Sigmoid widening in the early epochs and stabilizing in later epochs as all models approach their respective convergence plateaus.
