Stacking - Yousef's Notes
Stacking

Stacking

  • Ensemble learning is training an ensemble model, which is a combination of several base models, each individually performing worse than the ensemble model.
  • Train an ensemble of weak models, e.g., random forest and gradient boosting.
  • Combining multiple weak models can create a strong ensemble because when multiple uncorrelated models agree, they are likely to agree on the correct outcome.
  • Uncorrelated: use different features and/or different types of weak models. Same models with different hyperparameter tuning are unlikely to be strong.
  • Three ways to combine weakly correlated models into an ensemble model:
    • Averaging: for regression or score-based classification. Averaging all models’ predictions about input $x$.
    • Majority vote: for classification. Returning the majority class among all models’ predictions about input $x$.
    • Model stacking: trains a strong model by inputting the outputs of other strong models.

  • Combine classifiers $f_{1}, f_{2}$, and $f_{3}$, to train a metamodel
  • All f predict the same set of classes.
  • Create a synthetic dataset from the predictions of $\mathrm{f}_{1-3}$.
  • Train the metamodel on the synthetic dataset.

#Training

  • Use different machine learning algorithms and models,
  • Weakly correlate base models by randomly sampling the features of the training dataset.
  • The same learning algorithm can produce sufficiently uncorrelated models when trained with very different hyperparameter values.

#Evaluation & Tuning

  • Ensure the meta-model combines base model outputs effectively, without overfitting
  • Use k-fold cross-validation: how well a model generalizes to an independent dataset.
  • Recap: split the dataset into k roughly equal parts (folds) and use one fold as the validation set and the others as the training set. Average the performance across all k folds.
  • Prevent data leakage by using only out-of-fold predictions to train the meta-model
  • Evaluate the full stack: it should outperform individual base models on the validation set.
  • Tune base model hyperparameters independently, then train the metamodel on out-of-fold predictions.
  • Tune the whole stack jointly via nested cross-validation (computationally expensive).

#Data Leakage

  • Create the synthetic training set for the stacked model, following a process similar to the one for cross-validation.
  • First, split all training data into $>10$ blocks. The more blocks the better, but the process of training the model will be slower.
  • Temporarily exclude one block from the training data, and train the base models on the remaining blocks.
  • Then use the models to make predictions on the excluded block.
  • Obtain the predictions, and build the synthetic training examples for the excluded block by using the predictions from the base models.