P05 — Handling Imbalanced Data

Aim

To understand, detect, and address the problem of class imbalance in binary classification datasets using three complementary deep learning and machine learning techniques: (1) SMOTE (Synthetic Minority Over-sampling Technique) for data-level resampling, (2) class-weighted loss functions for algorithm-level cost-sensitive learning, and (3) Balanced Random Forest ensemble method. The practical demonstrates evaluation using precision, recall, and F1-score metrics that properly reflect model performance on minority classes.

Prerequisites

Python & NumPy

TensorFlow / Keras

Scikit-learn

imblearn Library

Binary Classification

Class Imbalance Concepts

Confusion Matrix

Precision, Recall, F1-Score

Sigmoid Activation

Binary Cross-Entropy

Theory

Class imbalance is a pervasive problem in real-world machine learning where the number of examples in one class (the majority class) far exceeds the number of examples in another class (the minority class). A dataset with a 95:5 split, for instance, means 95% of samples belong to Class 0 and only 5% to Class 1. When trained naively on such data, a model can achieve 95% accuracy by simply predicting the majority class for every sample — a phenomenon called the accuracy paradox. Such a model is useless for detecting the minority class, which is often the class of greatest interest (e.g., fraud, disease, equipment failure).

The root cause of this failure lies in the loss landscape. Standard loss functions like binary cross-entropy L = −(y · log(p) + (1−y) · log(1−p)) are averaged over all samples. With 95% majority-class samples, the gradient is dominated by majority-class errors, and the model learns to prioritize overall accuracy at the expense of minority-class recall. Evaluation metrics must therefore shift away from accuracy to Precision (fraction of predicted positives that are correct), Recall (fraction of actual positives correctly identified), and the F1-Score (harmonic mean: 2PR/(P+R)), which together paint a complete picture of per-class performance.

Method 1

SMOTE

Data-level approach that generates synthetic minority samples by interpolating between existing minority instances and their k-nearest neighbors.

Method 2

Class Weights

Algorithm-level approach that penalizes minority-class misclassifications more heavily by assigning a higher loss weight to the under-represented class.

Method 3

Balanced Random Forest

Ensemble method that combines bagging with random under-sampling of the majority class in each bootstrap sample to build balanced decision trees.

SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla et al. in 2002, operates at the data level. For each minority-class sample, SMOTE finds its k nearest minority-class neighbors (typically k=5), randomly selects one, and generates a synthetic point along the line segment connecting the two: x_new = x_i + λ · (x_j − x_i), where λ is a random number in [0, 1]. Unlike simple random oversampling (which duplicates existing points and encourages overfitting), SMOTE creates genuinely new samples that expand the minority class's decision region. The resampling is applied only to the training set to prevent data leakage — the test set must reflect the original, real-world distribution.

Class-weighted loss operates at the algorithm level without modifying the data. The weight for each class is typically computed as w_c = (1 / n_c) · (N / 2), where n_c is the number of samples in class c and N is the total number of samples. For a 95:5 split, Class 0 receives a weight near 0.53 and Class 1 a weight near 10.0. During backpropagation, the gradient contribution from minority-class errors is scaled up by this weight factor, forcing the model to pay more attention to the under-represented class. In Keras, this is passed via the class_weight argument in model.fit().

Balanced Random Forest (BRF), from the imbalanced-learn library, combines ensemble learning with built-in resampling. In a standard Random Forest, each tree is trained on a bootstrap sample drawn with replacement from the full training set. In BRF, each bootstrap sample is randomly under-sampled to achieve class balance before training each tree. This means every tree sees an equal number of majority and minority samples, and the ensemble vote aggregates decisions from hundreds of such balanced trees. This typically yields strong recall on the minority class while maintaining the robustness and low variance characteristic of random forests.

Algorithm / Step-by-Step

Import required libraries: tensorflow, numpy, sklearn.datasets.make_classification, train_test_split, classification_report, SMOTE from imblearn, and BalancedRandomForestClassifier from imblearn.
Generate a synthetic imbalanced dataset using make_classification() with n_samples=10000, n_features=20, n_classes=2, and weights=[0.95, 0.05] to create a 95:5 class distribution.
Split the data into training and test sets using train_test_split() with test_size=0.2 and random_state=42 for reproducibility.
Print the class distribution in the training set to confirm imbalance (expect ~7600:400 split).
Method 1 — SMOTE:
- Instantiate SMOTE(random_state=42) and call fit_resample(X_train, y_train) to generate balanced training data.
- Verify the new class distribution is equal (50:50).
- Build a simple ANN: Dense(16, ReLU) → Dense(1, Sigmoid).
- Compile with binary_crossentropy and adam optimizer.
- Train for 5 epochs on the SMOTE-resampled data.
- Evaluate on the original test set and print classification report.
Method 2 — Class Weights:
- Compute class weights using the inverse frequency formula: weight_for_c = (1 / count_c) * (total / 2.0).
- Build an identical ANN architecture as Method 1.
- Train on the original (imbalanced) training data, but pass class_weight={0: w0, 1: w1} to model.fit().
- Evaluate on the test set and print classification report.
Method 3 — Balanced Random Forest:
- Instantiate BalancedRandomForestClassifier(n_estimators=100, random_state=42).
- Fit directly on the original imbalanced training data — no resampling needed.
- Predict on the test set and print classification report.
Compare the three classification reports side-by-side, focusing on minority-class (Class 1) precision, recall, and F1-score.

Key Code Concepts

Snippet 1 — Generating the Imbalanced Dataset

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# 10000 samples, 20 features, 2 classes, 95% vs 5% split
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_classes=2,
    weights=[0.95, 0.05],
    random_state=42
)

# 80-20 train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Class 0: {np.sum(y_train == 0)}, Class 1: {np.sum(y_train == 1)}")
# Output: Class 0: 7602, Class 1: 398

The make_classification utility from scikit-learn creates synthetic datasets with controllable class balance via the weights parameter. This is invaluable for benchmarking imbalance-handling techniques because the ground-truth distribution is known and reproducible. The random_state=42 ensures every run produces the identical dataset, enabling fair comparison across the three methods.

Snippet 2 — SMOTE Resampling

from imblearn.over_sampling import SMOTE

# Apply SMOTE only to training data — never to test data
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

print(f"After SMOTE — Class 0: {np.sum(y_train_res == 0)}, Class 1: {np.sum(y_train_res == 1)}")
# Output: Class 0: 7602, Class 1: 7602 (balanced!)

# Train ANN on SMOTE-balanced data
model_smote = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model_smote.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_smote.fit(X_train_res, y_train_res, epochs=5, verbose=0)

y_pred = (model_smote.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

SMOTE's fit_resample() generates synthetic minority samples until both classes have equal counts. The synthetic generation uses k-nearest neighbors in feature space, creating plausible new points rather than duplicating existing ones. Critical: SMOTE is applied only to the training set. Applying it to the test set would inflate performance unrealistically since the test distribution would no longer match the real-world scenario.

Snippet 3 — Class-Weighted Loss

# Compute inverse-frequency class weights
total = len(y_train)
pos = np.sum(y_train == 1)   # minority count
neg = total - pos              # majority count

weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)
class_weight = {0: weight_for_0, 1: weight_for_1}
# Example: {0: 0.526, 1: 10.055}

# Same architecture, but train on original data with class_weight
model_weighted = tf.keras.Sequential([...])  # identical to above
model_weighted.fit(
    X_train, y_train,
    epochs=5,
    class_weight=class_weight,  # ← key parameter
    verbose=0
)

The weight formula w_c = (1/n_c) · (N/2) normalizes so that the two weights sum to approximately N (the total sample count). This means each class contributes equally to the total loss on expectation. The class_weight dictionary maps integer class labels to their corresponding scalar weights. Keras multiplies each sample's loss by its class weight before computing the gradient, effectively upweighting minority-class gradients by a factor of ~10× in this 95:5 scenario.

Snippet 4 — Balanced Random Forest Ensemble

from imblearn.ensemble import BalancedRandomForestClassifier

# Ensemble with built-in balanced bootstrapping
brf = BalancedRandomForestClassifier(
    n_estimators=100,   # 100 balanced trees
    random_state=42
)
brf.fit(X_train, y_train)

y_pred_brf = brf.predict(X_test)
print(classification_report(y_test, y_pred_brf))

Unlike the neural network approaches, Balanced Random Forest requires no data modification and no manual weight calculation. It uses an ensemble of 100 decision trees where each tree is trained on a randomly under-sampled bootstrap that balances the classes. The final prediction is a majority vote across all trees. This method often achieves strong minority-class recall with minimal hyperparameter tuning and is particularly effective when the feature space has informative structure that tree-based models can exploit.

Expected Output

Dataset Generation: Console confirms the imbalanced split with approximately Class 0: 7602 samples and Class 1: 398 samples in the training set. A naive "always predict 0" classifier would achieve ~95% accuracy, highlighting why accuracy alone is misleading.

SMOTE Resampling: After applying SMOTE, the class distribution becomes perfectly balanced at approximately 7602:7602. The classification report should show significantly improved recall for Class 1 (typically 0.75–0.85) compared to a naive model, with an F1-score for the minority class in the range of 0.50–0.65. Precision for Class 1 may be moderate (~0.40–0.55) due to some synthetic samples being generated in ambiguous regions.

Class-Weighted Model: The calculated weights print as approximately {0: 0.53, 1: 10.06}. The model achieves comparable minority-class recall to SMOTE (often 0.70–0.80) but may show slightly different precision characteristics since it operates on the original data distribution. The F1-score for Class 1 typically falls in the 0.45–0.60 range.

Balanced Random Forest: The ensemble often achieves the strongest overall minority-class performance with Class 1 recall typically reaching 0.80–0.90 and F1-scores in the 0.55–0.70 range, outperforming both neural network approaches on this tabular synthetic dataset. The model also outputs feature importance scores accessible via brf.feature_importances_.

Method	Class 1 Precision	Class 1 Recall	Class 1 F1	Overall Accuracy
Naive (Predict Majority)	0.00	0.00	0.00	~95%
SMOTE + ANN	0.40–0.55	0.75–0.85	0.50–0.65	~85–92%
Class-Weighted ANN	0.35–0.50	0.70–0.80	0.45–0.60	~85–92%
Balanced Random Forest	0.45–0.60	0.80–0.90	0.55–0.70	~88–94%

Note: Exact values vary across runs due to neural network initialization randomness. The table shows approximate ranges observed over multiple executions.

Viva Questions & Answers

Q1. What is the accuracy paradox in the context of imbalanced datasets?

The accuracy paradox refers to the phenomenon where a classifier achieves high accuracy (e.g., 95%) by simply predicting the majority class for all samples, while completely failing to detect the minority class. In a 95:5 imbalanced dataset, a model that always predicts the majority class will have 95% accuracy but 0% recall for the minority class. This makes accuracy a misleading metric for imbalanced problems, and necessitates the use of precision, recall, F1-score, and the confusion matrix for proper evaluation.

Q2. How does SMOTE generate synthetic samples differently from random oversampling?

Random oversampling simply duplicates existing minority-class samples, which can lead to overfitting since the model sees the exact same points repeatedly. SMOTE (Synthetic Minority Over-sampling Technique), on the other hand, creates entirely new synthetic samples by interpolating between a minority-class sample and one of its k nearest minority-class neighbors. The formula is x_new = x_i + λ(x_j − x_i) where λ ∈ [0,1] is random. This expands the minority class's decision boundary in feature space rather than replicating existing points, reducing overfitting and improving generalization.

Q3. Why should SMOTE be applied only to the training set and not the test set?

Applying SMOTE to the test set constitutes data leakage and produces an unrealistic evaluation. The test set must represent the true, original distribution that the model will encounter in production. If the test set is resampled to be balanced, the reported metrics (especially precision and recall) will be artificially inflated and will not reflect real-world performance. SMOTE is a training-time data augmentation technique — it modifies how the model learns, not what distribution it is evaluated against.

Q4. Explain the mathematical intuition behind class weights in binary cross-entropy loss.

The standard binary cross-entropy loss averages equally over all samples: L = −(1/N)Σ[y_i log(p_i) + (1−y_i)log(1−p_i)]. With class weights w_0 and w_1, the weighted loss becomes L = −(1/N)Σ[w_1 · y_i log(p_i) + w_0 · (1−y_i)log(1−p_i)]. When w_1 >> w_0 (as in our 95:5 case where w_1 ≈ 10), minority-class misclassifications contribute ~10× more to the gradient during backpropagation. This forces the optimizer to prioritize correcting minority-class errors, shifting the decision boundary toward better minority-class recall.

Q5. How does Balanced Random Forest differ from a standard Random Forest?

In a standard Random Forest, each decision tree is trained on a bootstrap sample drawn with replacement from the full training set, preserving the original class distribution. In a Balanced Random Forest, each bootstrap sample is randomly under-sampled to achieve class balance before training each tree. Specifically, for each tree, BRF randomly selects a subset of majority-class samples equal in size to the minority-class count, creating a balanced training set. The final prediction aggregates votes from hundreds of such trees, each trained on a different balanced subsample. This built-in resampling at the ensemble level often yields superior minority-class recall without requiring any external data preprocessing.

NIELIT Ropar

Handling & Evaluating
Imbalanced Data using Resampling, Class Weights & Ensemble Methods

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Handling & EvaluatingImbalanced Data using Resampling, Class Weights & Ensemble Methods

Aim

Prerequisites

Theory

Algorithm / Step-by-Step

Key Code Concepts

Expected Output

Viva Questions & Answers

Handling & Evaluating
Imbalanced Data using Resampling, Class Weights & Ensemble Methods