What's going on with the Open LLM Leaderboard?Recently an... | What's going on with the Open LLM Leaderboard?Recently an...
What's going on with the Open LLM Leaderboard?
Recently an interesting discussion arose on Twitter following the release of Falcon ๐Ÿฆ… and its addition to the Open LLM Leaderboard, a public leaderboard comparing open access large language models.

The discussion centered around one of the four evaluations displayed on the leaderboard: a benchmark for measuring Massive Multitask Language Understanding (shortname: MMLU).

The community was surprised that MMLU evaluation numbers of the current top model on the leaderboard, the LLaMA model ๐Ÿฆ™, were significantly lower than the numbers in the published LLaMa paper.

So we decided to dive in a rabbit hole to understand what was going on and how to fix it ๐Ÿ•ณ๐Ÿ‡

In our quest, we discussed with both the great @javier-m who collaborated on the evaluations of LLaMA and the amazing @slippylolo from the Falcon team. This being said, all the errors in the below should be attributed to us rather than them of course!

Along this journey with us youโ€™ll learn a lot about the ways you can evaluate a model on a single evaluation and whether or not to believe the numbers you see online and in papers.

Ready? Then buckle up, weโ€™re taking off ๐Ÿš€.