Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Larger language models have higher accu- racy on average, but are they better on ev- ery single instance (datapoint)?

BibTex: