What's going on with the Open LLM Leaderboard?
Recently an interesting discussion arose on Twitter following the release of Falcon ๐ฆ and its addition to the Open LLM Leaderboard, a public leaderboard comparing open access large language models.
The discussion centered around one of the four evaluations displayed on the leaderboard: a benchmark for measuring Massive Multitask Language Understanding (shortname: MMLU).
The community was surprised that MMLU evaluation numbers of the current top model on the leaderboard, the LLaMA model ๐ฆ, were significantly lower than the numbers in the published LLaMa paper.
So we decided to dive in a rabbit hole to understand what was going on and how to fix it ๐ณ๐
In our quest, we discussed with both the great @javier-m who collaborated on the evaluations of LLaMA and the amazing @slippylolo from the Falcon team. This being said, all the errors in the below should be attributed to us rather than them of course!
Along this journey with us youโll learn a lot about the ways you can evaluate a model on a single evaluation and whether or not to believe the numbers you see online and in papers.
Ready? Then buckle up, weโre taking off ๐.
Recently an interesting discussion arose on Twitter following the release of Falcon ๐ฆ and its addition to the Open LLM Leaderboard, a public leaderboard comparing open access large language models.
The discussion centered around one of the four evaluations displayed on the leaderboard: a benchmark for measuring Massive Multitask Language Understanding (shortname: MMLU).
The community was surprised that MMLU evaluation numbers of the current top model on the leaderboard, the LLaMA model ๐ฆ, were significantly lower than the numbers in the published LLaMa paper.
So we decided to dive in a rabbit hole to understand what was going on and how to fix it ๐ณ๐
In our quest, we discussed with both the great @javier-m who collaborated on the evaluations of LLaMA and the amazing @slippylolo from the Falcon team. This being said, all the errors in the below should be attributed to us rather than them of course!
Along this journey with us youโll learn a lot about the ways you can evaluate a model on a single evaluation and whether or not to believe the numbers you see online and in papers.
Ready? Then buckle up, weโre taking off ๐.