Aim
To understand, detect, and address the problem of class imbalance in binary classification datasets using three complementary deep learning and machine learning techniques: (1) SMOTE (Synthetic Minority Over-sampling Technique) for data-level resampling, (2) class-weighted loss functions for algorithm-level cost-sensitive learning, and (3) Balanced Random Forest ensemble method. The practical demonstrates evaluation using precision, recall, and F1-score metrics that properly reflect model performance on minority classes.
Prerequisites
Theory
Class imbalance is a pervasive problem in real-world machine learning where the number of examples in one class (the majority class) far exceeds the number of examples in another class (the minority class). A dataset with a 95:5 split, for instance, means 95% of samples belong to Class 0 and only 5% to Class 1. When trained naively on such data, a model can achieve 95% accuracy by simply predicting the majority class for every sample — a phenomenon called the accuracy paradox. Such a model is useless for detecting the minority class, which is often the class of greatest interest (e.g., fraud, disease, equipment failure).
The root cause of this failure lies in the loss landscape. Standard loss functions like binary cross-entropy L = −(y · log(p) + (1−y) · log(1−p)) are averaged over all samples. With 95% majority-class samples, the gradient is dominated by majority-class errors, and the model learns to prioritize overall accuracy at the expense of minority-class recall. Evaluation metrics must therefore shift away from accuracy to Precision (fraction of predicted positives that are correct), Recall (fraction of actual positives correctly identified), and the F1-Score (harmonic mean: 2PR/(P+R)), which together paint a complete picture of per-class performance.
SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla et al. in 2002, operates at the data level. For each minority-class sample, SMOTE finds its k nearest minority-class neighbors (typically k=5), randomly selects one, and generates a synthetic point along the line segment connecting the two: xnew = xi + λ · (xj − xi), where λ is a random number in [0, 1]. Unlike simple random oversampling (which duplicates existing points and encourages overfitting), SMOTE creates genuinely new samples that expand the minority class's decision region. The resampling is applied only to the training set to prevent data leakage — the test set must reflect the original, real-world distribution.
Class-weighted loss operates at the algorithm level without modifying the data. The
weight for each class is typically computed as wc = (1 / nc) ·
(N / 2), where nc is the number of samples in class c and N is the total number
of samples. For a 95:5 split, Class 0 receives a weight near 0.53 and Class 1 a weight near 10.0.
During backpropagation, the gradient contribution from minority-class errors is scaled up by this
weight factor, forcing the model to pay more attention to the under-represented class. In Keras,
this
is passed via the class_weight argument in model.fit().
Balanced Random Forest (BRF), from the imbalanced-learn library, combines ensemble learning with built-in resampling. In a standard Random Forest, each tree is trained on a bootstrap sample drawn with replacement from the full training set. In BRF, each bootstrap sample is randomly under-sampled to achieve class balance before training each tree. This means every tree sees an equal number of majority and minority samples, and the ensemble vote aggregates decisions from hundreds of such balanced trees. This typically yields strong recall on the minority class while maintaining the robustness and low variance characteristic of random forests.
Algorithm / Step-by-Step
- Import required libraries:
tensorflow,numpy,sklearn.datasets.make_classification,train_test_split,classification_report,SMOTEfrom imblearn, andBalancedRandomForestClassifierfrom imblearn. - Generate a synthetic imbalanced dataset using
make_classification()withn_samples=10000,n_features=20,n_classes=2, andweights=[0.95, 0.05]to create a 95:5 class distribution. - Split the data into training and test sets using
train_test_split()withtest_size=0.2andrandom_state=42for reproducibility. - Print the class distribution in the training set to confirm imbalance (expect ~7600:400 split).
- Method 1 — SMOTE:
- Instantiate
SMOTE(random_state=42)and callfit_resample(X_train, y_train)to generate balanced training data. - Verify the new class distribution is equal (50:50).
- Build a simple ANN: Dense(16, ReLU) → Dense(1, Sigmoid).
- Compile with
binary_crossentropyandadamoptimizer. - Train for 5 epochs on the SMOTE-resampled data.
- Evaluate on the original test set and print classification report.
- Instantiate
- Method 2 — Class Weights:
- Compute class weights using the inverse frequency formula:
weight_for_c = (1 / count_c) * (total / 2.0). - Build an identical ANN architecture as Method 1.
- Train on the original (imbalanced) training data, but pass
class_weight={0: w0, 1: w1}tomodel.fit(). - Evaluate on the test set and print classification report.
- Compute class weights using the inverse frequency formula:
- Method 3 — Balanced Random Forest:
- Instantiate
BalancedRandomForestClassifier(n_estimators=100, random_state=42). - Fit directly on the original imbalanced training data — no resampling needed.
- Predict on the test set and print classification report.
- Instantiate
- Compare the three classification reports side-by-side, focusing on minority-class (Class 1) precision, recall, and F1-score.
Key Code Concepts
Snippet 1 — Generating the Imbalanced Dataset
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split import numpy as np # 10000 samples, 20 features, 2 classes, 95% vs 5% split X, y = make_classification( n_samples=10000, n_features=20, n_classes=2, weights=[0.95, 0.05], random_state=42 ) # 80-20 train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"Class 0: {np.sum(y_train == 0)}, Class 1: {np.sum(y_train == 1)}") # Output: Class 0: 7602, Class 1: 398
The make_classification utility from scikit-learn creates synthetic datasets with
controllable class balance via the weights parameter. This is invaluable for
benchmarking imbalance-handling techniques because the ground-truth distribution is known and
reproducible. The random_state=42 ensures every run produces the identical dataset,
enabling fair comparison across the three methods.
Snippet 2 — SMOTE Resampling
from imblearn.over_sampling import SMOTE # Apply SMOTE only to training data — never to test data smote = SMOTE(random_state=42) X_train_res, y_train_res = smote.fit_resample(X_train, y_train) print(f"After SMOTE — Class 0: {np.sum(y_train_res == 0)}, Class 1: {np.sum(y_train_res == 1)}") # Output: Class 0: 7602, Class 1: 7602 (balanced!) # Train ANN on SMOTE-balanced data model_smote = tf.keras.Sequential([ tf.keras.layers.Dense(16, activation='relu', input_shape=(20,)), tf.keras.layers.Dense(1, activation='sigmoid') ]) model_smote.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) model_smote.fit(X_train_res, y_train_res, epochs=5, verbose=0) y_pred = (model_smote.predict(X_test) > 0.5).astype(int) print(classification_report(y_test, y_pred))
SMOTE's fit_resample() generates synthetic minority samples until both classes have
equal counts. The synthetic generation uses k-nearest neighbors in feature space, creating
plausible new points rather than duplicating existing ones. Critical: SMOTE is
applied only to the training set. Applying it to the test set would inflate performance
unrealistically since the test distribution would no longer match the real-world scenario.
Snippet 3 — Class-Weighted Loss
# Compute inverse-frequency class weights total = len(y_train) pos = np.sum(y_train == 1) # minority count neg = total - pos # majority count weight_for_0 = (1 / neg) * (total / 2.0) weight_for_1 = (1 / pos) * (total / 2.0) class_weight = {0: weight_for_0, 1: weight_for_1} # Example: {0: 0.526, 1: 10.055} # Same architecture, but train on original data with class_weight model_weighted = tf.keras.Sequential([...]) # identical to above model_weighted.fit( X_train, y_train, epochs=5, class_weight=class_weight, # ← key parameter verbose=0 )
The weight formula wc = (1/nc) · (N/2) normalizes so that the
two weights sum to approximately N (the total sample count). This means each class contributes
equally to the total loss on expectation. The class_weight dictionary maps integer
class labels to their corresponding scalar weights. Keras multiplies each sample's loss by its
class weight before computing the gradient, effectively upweighting minority-class gradients by a
factor of ~10× in this 95:5 scenario.
Snippet 4 — Balanced Random Forest Ensemble
from imblearn.ensemble import BalancedRandomForestClassifier # Ensemble with built-in balanced bootstrapping brf = BalancedRandomForestClassifier( n_estimators=100, # 100 balanced trees random_state=42 ) brf.fit(X_train, y_train) y_pred_brf = brf.predict(X_test) print(classification_report(y_test, y_pred_brf))
Unlike the neural network approaches, Balanced Random Forest requires no data modification and no manual weight calculation. It uses an ensemble of 100 decision trees where each tree is trained on a randomly under-sampled bootstrap that balances the classes. The final prediction is a majority vote across all trees. This method often achieves strong minority-class recall with minimal hyperparameter tuning and is particularly effective when the feature space has informative structure that tree-based models can exploit.
Expected Output
Dataset Generation: Console confirms the imbalanced split with approximately Class 0: 7602 samples and Class 1: 398 samples in the training set. A naive "always predict 0" classifier would achieve ~95% accuracy, highlighting why accuracy alone is misleading.
SMOTE Resampling: After applying SMOTE, the class distribution becomes perfectly balanced at approximately 7602:7602. The classification report should show significantly improved recall for Class 1 (typically 0.75–0.85) compared to a naive model, with an F1-score for the minority class in the range of 0.50–0.65. Precision for Class 1 may be moderate (~0.40–0.55) due to some synthetic samples being generated in ambiguous regions.
Class-Weighted Model: The calculated weights print as approximately
{0: 0.53, 1: 10.06}. The model achieves comparable minority-class recall to SMOTE
(often 0.70–0.80) but may show slightly different precision characteristics since it operates
on the original data distribution. The F1-score for Class 1 typically falls in the
0.45–0.60 range.
Balanced Random Forest: The ensemble often achieves the strongest overall
minority-class performance with Class 1 recall typically reaching 0.80–0.90
and F1-scores in the 0.55–0.70 range, outperforming both neural network
approaches on this tabular synthetic dataset. The model also outputs feature importance scores
accessible via brf.feature_importances_.
| Method | Class 1 Precision | Class 1 Recall | Class 1 F1 | Overall Accuracy |
|---|---|---|---|---|
| Naive (Predict Majority) | 0.00 | 0.00 | 0.00 | ~95% |
| SMOTE + ANN | 0.40–0.55 | 0.75–0.85 | 0.50–0.65 | ~85–92% |
| Class-Weighted ANN | 0.35–0.50 | 0.70–0.80 | 0.45–0.60 | ~85–92% |
| Balanced Random Forest | 0.45–0.60 | 0.80–0.90 | 0.55–0.70 | ~88–94% |
Note: Exact values vary across runs due to neural network initialization randomness. The table shows approximate ranges observed over multiple executions.
