Teacher Forcing

  • from
  • Technique where the target word (ground truth word) is passed as the next input to the decoder instead of its last prediction.
  • common technique to train Basic RNN Architectures or Transformer
    • used in imageCaptioning , Machine Translation
    • but also in Time Series forecasting
  • intuition
    • math exam with dependent questions, e.g. a) depends on b), b) on c) and so on
    • if a) is wrong, all subsequent questions are also wrong
    • teacher forcing: after answering question a), the teacher compares it to the correct solution and grades it and then gives us the correct answer for a) to continue with
  • for example in sequence generation with RNN the situation is similar
    • each prediction depends on the last one, thus when one is wrong all subsequent will be wrong as well
  • no memorization can happen
    • the network can not look into the future
    • ground truth is only fed as last prediction not as the current
  • loss does not need to be updated at each timestep, only needs to have a list with the true predictions of the model from which then the loss is calculated
  • pros
    • training converges faster, because early predictions are very bad
  • cons
    • no ground truth label during inference, thus no teacher forcing
    • discrepancy between training and inference scores
      • can lead to poor model performance and instability
      • known as Exposure Bias