Chinchilla

  • Training Compute-Optimal Large Language Models
  • given a 10x increase in computational budget, model size should increase 5.5x, and the number of tokens should only increase 1.8x
  • model and data size should increase in accordance
  • collecting high-quality datasets will play a key role in further scaling of LLMs
  • optimal model size and number of tokens for training a Transformer language model under a given compute budget
  • By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled
  • significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks
  • ubstantially less compute for fine-tuning and inference, greatly facilitating downstream usage
  • MMLU