Ensemble Distillation

  • Distilling the Knowledge in a Neural Network
  • training many different models on the same data and then to average their predictions
  • making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets
  • compress the knowledge in an ensemble into a single model
  • MNIST
  • ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse
  • specialist models can be trained rapidly and in parallel
  • distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes
  • performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster.