Easier to change or measure than the actual objective
Suppose we have some sample space S (such as the set of possible question-answer pairs), some Probability distribution P over S, a true objective (or “reward”) Rtrue:S→R , proxy objective Rproxy:S→R and we optimize Rproxy to get a new distribution P′
Ex′∼P′[Rtrue(x′)] is how well the true objective is optimized
If N≥n samples from P, simultaneously consider every possible subset of these samples of size nnn, weight each sample by the number of subsets for which it is the best according to the proxy objective, and then take the weighted average true objective (n−1k−1) where k is the rank of the sample under the proxy objective, from 1 (worst) up to N (best)
Can reuse samples of n
KL DivergenceP′∣∣P measures how much optimization is done