Assume we are building a model from scratch, based on a chosen architecture.
- Define a performance metric P.
- Define the Cost Function C.
- Pick a parameter-initialization strategy W.
- Pick a cost-function optimization algorithm A.
- Choose a Hypterparameter Tuning strategy T.
- Pick a combination H of hyperparameter values using the tuning strategy T.
- Initialize model M parameters using strategy W.
- Train M, using algorithm A, parametrized with hyperparameters H, to optimize cost function C.
- If there are still untested hyperparameter values, pick another combination H of hyperparameter values using strategy T, and repeat step 7 and 8.
- Return the model for which the metric P was optimized.
We choose P, C, W, A, and T depending on data characteristics, model complexity, and training dynamics.
| # | Step | Examples |
|---|---|---|
| 1 | Performance metric (P) | - Accuracy (classification) |
| - AUC-ROC (binary classification) | ||
| - F1-score (imbalanced data) | ||
| - MSE / RMSE (regression) | ||
| 2 | Cost/loss function (C) | - Cross-entropy loss (classification) |
| - Mean Squared Error (MSE) (regression) | ||
| - Categorical cross-entropy (multiclass) | ||
| 3 | Parameter initialization strategy (W) | - Xavier (Glorot) Initialization (tanh activations) |
| - He Initialization (ReLU activations) | ||
| - Random Uniform / Normal (can be unstable) | ||
| 4 | Optimization algorithm (A) | - Stochastic Gradient Descent (SGD) |
| - Adam | ||
| - RMSProp | ||
| - Adagrad | ||
| 5 | Hyperparameter tuning strategy (T) | - Grid Search |
| - Random Search | ||
| - Bayesian Optimization | ||
| - Hyperband | ||
| - Manual Tuning |
#Performance Metric (P)
- Measures how good the model is , i.e. we want to maximize this.
- We define a metric that would allow comparing the performance of two models on the holdout data and select the better of the two.
#Cost Function (C)
- Measures how bad the model is during training, i.e. we want to minimize this.
- Define what our learning algorithm will optimize to train a model.
- Regression: MSE
- Classification:
- Categorical Cross-Entropy for multiclass classification
- Binary Cross-Entropy for binary and multi-label classification.
#Parameter Initialization Strategy (W)
- Decides how the model starts learning. Influences learning speed and stability.
#Optimization Algorithm (A)
- Determines how weights get updated after computing loss gradients.
#Hyperparameter tuning strategy (T)
- Find best hyperparameters, e.g. learning rate, batch size, architecture depth.
- Neural networks start with unknown weights/bias, updated by training.
- Initialization gives a starting point for optimization (gradient descent).
- Good initialization: faster convergence, stable gradients (avoiding vanishing/exploding) and better model performance.
- Bad initialization:
- All weights $=0$ or 1: no learning because of gradients collapse or identical updates.
- Too large weights: exploding gradients; too small weights: vanishing gradients.
- Common initialization strategies: random normal, random uniform, Xavier, He.
- PyTorch:
nn.init.xavier_uniform_(model.layer.weight)nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')nn.init.zeros_(layer.bias) \# Bias often initialized to zero
- Training and validation cycle:
- If performance does not improve, we can pick a different combination of hyperparameters and build a different model.
- We will continue to test different values of hyperparameters until there are no more values to test.
- Then we keep the best model among those we trained in the process.
- If the performance of the best model is still not satisfactory, we try a different network architecture, add more labeled data, or try transfer learning.
- Hyperparameter space:
- Hyperparameter values strongly affect the properties of a trained NN.
- Hyperparameter tuning is time and resource intensive.
- We must decide which hyperparameters are important enough to spend the time on.
- While there is no definitive answer. While training:
- Start with default values and then change them (e.g., see the ‘cards’ in Lecture 16 + PyTorch)
- Observe to which parameters the model is more sensitive
- Tuning hyperparameters, as opposed to using the default value
- Tuning the hyperparameters to which the model is sensitive.
- Table shows approximate sensitivity of a model to some hyperparameters.
| Hyperparameter | Sensitivity |
|---|---|
| Learning rate | High |
| Learning rate schedule | High |
| Loss function | High |
| Units per layer | High |
| Parameter initialization strategy | Medium |
| Number of layers | Medium |
| Layer properties | Medium |
| Degree of regularization | Medium |
| Choice of optimizer | Low |
| Optimizer properties | Low |
| Size of minibatch | Low |
| Choice of non-linearity | Low |
#Common Configurations
#Classification
- Performance Metric (P)
- Accuracy for balanced dataset
- F1-score for imbalanced dataset
- Cost/loss function (C)
- Cross-entropy for multiclass or multilabel classification
- Parameter Initialization Strategy (W)
- Xavier when using tanh as the activation function.
- He when using ReLU as the activation function
- Optimization Algorithm (A)
- Adam with adaptive learning rate and handles sparse gradients well
- Stochastic gradient descent (SGD) strong generalization, stability for large datasets
- Hyperparameter tuning strategy (T)
- Random Search over learning rate, dropout rate, embedding dimension, layers
- Bayesian
#Regression
- Performance Metric (P)
- Root Mean Squared Error (RMSE) for data without large errors and outliers.
- Mean Absolute Error (MAE) for data with outliers and large errors
- Cost/loss function (C)
- Mean Squared Error (MSE) penalizes large errors, i.e. very sensitive to outliers
- Huber Loss combines MSE and MAE to lower sensitivity to outliers.
- Parameter Initialization Strategy (W)
- He initialization when using ReLU as the activation function
- Optimization Algorithm (A)
- RMSProp with time-series, RNNs, and sequential data but needs tuning
- Adam general purpose default choice for tabular data, images, text; robust and stable.
- Hyperparameter tuning strategy (T)
- Grid Search
#Imbalanced Classification
- Performance Metric (P)
- F1
- AUC
- Cost/loss function (C)
- Weighted cross-entropy
- Parameter Initialization Strategy (W)
- Xavier
- Optimization Algorithm (A)
- AdamW
- Hyperparameter tuning strategy (T)
- Bayesian Optimization
#Time Series Forecasting
- Performance Metric (P)
- RMSE
- MAE
- Cost/loss function (C)
- MSE
- Parameter Initialization Strategy (W)
- He
- Optimization Algorithm (A)
- RMSProp
- Hyperparameter tuning strategy (T)
- Grid Search
#Image Tasks (CNNs)
- Performance Metric (P)
- Accuracy
- Cost/loss function (C)
- Cross-Entropy
- Parameter Initialization Strategy (W)
- He
- Optimization Algorithm (A)
- SGD + Momentum
- Hyperparameter tuning strategy (T)
- Hyperband