đ¤
đ
đĽWhat's New
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 20% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal performance gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. MMStar is designed to benchmark 6 core capabilities and 18 detailed axes, aiming to evaluate the multi-modal capacities of LVLMs with a carefully balanced and purified selection of samples. The samples are first roughly selected from current benchmarks with an automated pipeline, strict human review is then involved to ensure each selected sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities for the solution. In addition to the traditional accuracy metric, we also develop two metrics to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on our MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the newly proposed metrics to investigate their data leakage and actual multi-modal gain. We hope these efforts can serve as a valuable addition to the research community in evaluating LVLMs.
We highlight cases in existing multi-modal benchmarks where evaluation samples either lack visual dependency or have unintentionally leaked into the training data of LLMs and LVLMs. (a) Some samples can be answered by LLMs using only text-based world knowledge; (b) For some instances, the question itself contains the answer, making images superfluous; (c) Some samples are leaked into LLMs' training corpora can be "recalled" with the textual questions and answers directly; (d) Some samples indiscernible to LLMs but solved by LVLMs without accessing images suggest leakage into LVLMs' multi-modal training data.
After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark. In MMStar, we display 6 core capabilities in the inner ring, with 18 detailed axes presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously balanced 250 samples. We further ensure a relatively even distribution across the 18 detailed axes.
@article{chen2024we, title={Are We on the Right Way for Evaluating Large Vision-Language Models?}, author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others}, journal={arXiv preprint arXiv:2403.20330}, year={2024} }