Img-Diff: Contrastive Data Synthesis for Multimodal Large... | Img-Diff: Contrastive Data Synthesis for Multimodal Large...
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
https://arxiv.org/abs/2408.04594

Abstract
High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components.