Transfer Learning - Yousef's Notes
Transfer Learning

Transfer Learning

  • Using a pre-trained model to build a new model.

  • Pretrained models are created using large amount of costly curated data.

  • The parameters learned by the pretrained models can be useful for our tasks.

  • A pre-trained model can be used in two ways:

  • its learned parameters can be used to initialize your own model, or

  • it can be used as a feature extractor for your model.

  • Using a pretrained model as initializer:

  • Current problem is similar to the one solved by the pretrained model

  • The optimal parameters for our problem will be similar to the pretrained parameters

  • Especially in the initial neural network layers (i.e., those closest to the input)

  • Learning: faster because gradient descent searches for the optimal parameter values in a smaller region of potentially good values.

  • If the model is pretrained on a dataset larger that ours, searching in a region of potentially good values might also lead to a better generalization.

  • Increases chances to infer patterns that are not in our limited training set.

  • Retraining pretrained models is potentially very expensive and resource intensive.

  • Pretrained model are very deep: hundreds/thousands layers, millions/billions parameters

  • present challenges of grading vanishing/explosion.

  • Using Pretrained model as feature extractor:

  • using some layers of the pre-trained model as feature extractors for your model

  • Keep N initial layers of the pretrained model, closest to and including the input layer

  • Keep their parameters “frozen,” that is, unchanged and unchangeable (i.e., no retraining).

  • Add new layers on top of the frozen layers, including the appropriate output layer

  • Use a problem-specific dataset for the training of the added layers.

  • Only the parameters of the new layers are updated by gradient descent during training

  • Alternatively, several right-most old layers could be set as trainable.

  • How many layers of the pretrained model to use or freeze in the new model?

  • This is up to the analyst: it’s part of the decisions we will make about the architecture that will work best for our problem.