Are We on the Right Way for Evaluating
Large Vision-Language Models?

NeurIPS 2024
1 University of Science and Technology of China 2 The Chinese University of Hong Kong
3 Shanghai Artificial Intelligence Laboratory

* Equal contribution. Corresponding authors.
§ Work done during an internship in Shanghai Artificial Intelligence Laboratory.

🚀 Two Key Issues that Lead to Misjudgment of LVLM Capability
🚀 An Elite Vision-indispensable Multi-modal Benchmark, LogoMMStar.
🚀 Two Metrics: Multi-modal Gain (MG) and Multi-modal Leakage (ML)

🔥What's New

  • [2024.09.26] 🎉🎉🎉 MMStar is accepted by NeurIPS 2024!
  • [2024.04.16] 🚀 MMStar has been supported in the VLMEvalKit repository and OpenCompass leaderboard!
  • [2024.04.02] 🚀 Huggingface Dataset and evaluation code are available!
  • [2024.04.01] 🚀 We released the ArXiv paper.

Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues:
1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 20% on average.
2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%.
Both problems lead to misjudgments of actual multi-modal performance gains and potentially misguide the study of LVLM.

To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. MMStar is designed to benchmark 6 core capabilities and 18 detailed axes, aiming to evaluate the multi-modal capacities of LVLMs with a carefully balanced and purified selection of samples. The samples are first roughly selected from current benchmarks with an automated pipeline, strict human review is then involved to ensure each selected sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities for the solution. In addition to the traditional accuracy metric, we also develop two metrics to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on our MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the newly proposed metrics to investigate their data leakage and actual multi-modal gain. We hope these efforts can serve as a valuable addition to the research community in evaluating LVLMs.

Logo The Overlooked Issues for Evaluating LVLMs

We highlight cases in existing multi-modal benchmarks where evaluation samples either lack visual dependency or have unintentionally leaked into the training data of LLMs and LVLMs.


(a) Some samples can be answered by LLMs using only text-based world knowledge;
(b) For some instances, the question itself contains the answer, making images superfluous;
(c) Some samples are leaked into LLMs' training corpora can be "recalled" with the textual questions and answers directly;
(d) Some samples indiscernible to LLMs but solved by LVLMs without accessing images suggest leakage into LVLMs' multi-modal training data.

Evaluation of various LVLMs on 6 popular multi-modal benchmarks

For the "strategy" column, "LLM" refers to evaluating using the corresponding LLM base of the LVLM, while "LVLM-text" denotes evaluating LVLMs without accessing images.
We employ the 0-shot inference strategy for LLMs to align the evaluation protocols of LVLMs.
The highest results of the LVLM-text setting across the models are highlighted in bold and underlined.
For the entire LVLMs' results, please refer to the appendix.

Logo MMStar Benchmark

After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark.


In MMStar, we display 6 core capabilities in the inner ring, with 18 detailed axes presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously balanced 250 samples. We further ensure a relatively even distribution across the 18 detailed axes.

Logo MMStar Leaderboard

CP (coarse perception), FP (fine-grained perception), IR(instance reasoning), LR (logical reasoning), ST (science & technology), MA (mathematics).
The best results are highlighted in bold and underlined. The worst results of multi-modal gain (MG) and multi-modal leakage (ML) metrics are in bold and red.
# Model LLM Param. CP FP IR LR ST MA Avg.✅ MG⬆️ ML⬇️
1 GPT4V (high) 🥇 GPT4-Turbo - 76.6 51.4 66.6 55.8 42.6 49.8 57.1 43.6 1.3
2 InternLM-XComposer2 🥈 InternLM2-7B 7B 70.8 48.8 65.2 56.4 42.0 49.2 55.4 28.1 7.5
3 LLaVA-Next 🥉 NH2-Yi-34B 34B 66.4 52.0 62.4 46.0 32.4 53.6 52.1 29.4 2.4
4 GPT4V (low) GPT4-Turbo - 62.0 32.8 55.2 48.0 33.6 44.8 46.1 32.6 1.3
5 InternVL-Chat-v1.2 NH2-Yi-34B 40B 67.6 43.2 61.2 47.2 24.0 19.2 43.7 32.6 0.0
6 GeminiPro-Vision GeminiPro - 51.6 28.8 50.8 46.0 28.4 50.0 42.6 27.4 0.0
7 MiniCPM-V-2 MiniCPM-2B-sft 2B 60.4 31.6 45.2 45.2 29.2 32.4 40.7 18.7 3.5
8 Sphinx-X-MoE Mixtral-8x7b 57B 58.4 40.8 47.6 35.2 19.2 32.0 38.9 14.8 1.0
9 Monkey-Chat Qwen-7B 10B 57.6 36.4 51.6 33.2 26.4 24.4 38.3 13.5 17.6
10 Yi-VL Yi-6B 6B 58.0 33.6 46.4 34.8 20.4 34.0 37.9 15.6 0.0
11 Qwen-VL-Chat Qwen-7B 8B 59.6 32.0 50.8 29.2 22.0 31.6 37.5 23.9 0.0
12 Deepseek-VL Deepseek-7B 8B 64.0 30.8 49.2 36.4 21.6 20.4 37.1 15.7 0.0
13 CogVLM-Chat Vicuna-v1.5-7B 17B 66.8 36.8 49.2 31.2 23.6 11.6 36.5 14.9 0.0
14 Yi-VL Yi-34B 34B 53.2 31.2 52.0 32.4 12.4 35.2 36.1 18.8 0.0
15 TinyLLaVA Phi2-2.7B 3B 60.4 31.6 50.8 30.4 18.0 24.8 36.0 16.4 7.6
16 ShareGPT4V Vicuna-v1.5-7B 7B 58.8 28.0 45.6 24.4 17.2 24.0 33.0 11.9 0.0
17 LLaVA-1.5 Vicuna-v1.5-13B 13B 58.8 28.0 41.6 24.4 18.4 25.6 32.8 13.9 0.0
18 LLaVA-1.5 Vicuna-v1.5-7B 7B 58.8 24.0 38.8 24.0 13.6 22.8 30.3 10.7 0.0
19 Random Choice - - 23.7 24.5 25.3 24.3 24.8 25.1 24.6 - -

📃 BibTeX


      @article{chen2024we,
        title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
        author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
        journal={arXiv preprint arXiv:2403.20330},
        year={2024}
      }