https://cdn-uploads.huggingface.co/production/uploads/64a... | https://cdn-uploads.huggingface.co/production/uploads/64a...
https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/QVl0iPtT5aUhlvViyEpgs.png Single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench:

* We evaluate this benchmark using chain-of-thought prompting.

+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.

Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.

Click to view multi-image results on Mantis Eval, BLINK Val, Mathverse mv, Sciverse mv, MIRB.
Click to view video results on Video-MME and Video-ChatGPT.
Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.
Examples