When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models
for the same prompt the order in which samples are provided can make the difference between state-of-the-art and random performance – essentially some permutations are “fantastic” and some not
problem is prevalent across tasks, model sizes (even for the largest current models), prompt templates, it is not related to a specific subset of samples, number of training samples, and that a given good permutation for one model is not transferable to another.
novel probing method that exploits the generative nature of language models to construct an artificial development set
identity performant permutations for prompts using Entropy-based statistics over this set, which yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks