LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
While developing Docmatix, we noticed that fine-tuning Florence-2 on it yielded great performance on DocVQA, but resulted in low scores on the benchmark. To enhance performance, we had to fine-tune the model further on DocVQA to learn the syntax required for the benchmark. Interestingly, this additional fine-tuning seemed to perform worse according to human evaluators, which is why we primarily used it for ablation studies and released the model only trained on Docmatix for broader use.

Although the generated answers semantically align with the reference answers, as illustrated in Figure 1, they still receive low scores. This raises these questions: Should we fine-tune the models to improve these metrics, or should we develop new metrics that better align with human perception?