MMStar

Abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues:
1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 20% on average.
2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%.
Both problems lead to misjudgments of actual multi-modal performance gains and potentially misguide the study of LVLM.

To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. MMStar is designed to benchmark 6 core capabilities and 18 detailed axes, aiming to evaluate the multi-modal capacities of LVLMs with a carefully balanced and purified selection of samples. The samples are first roughly selected from current benchmarks with an automated pipeline, strict human review is then involved to ensure each selected sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities for the solution. In addition to the traditional accuracy metric, we also develop two metrics to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on our MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the newly proposed metrics to investigate their data leakage and actual multi-modal gain. We hope these efforts can serve as a valuable addition to the research community in evaluating LVLMs.

We highlight cases in existing multi-modal benchmarks where evaluation samples either lack visual dependency or have unintentionally leaked into the training data of LLMs and LVLMs.

(a) Some samples can be answered by LLMs using only text-based world knowledge;
(b) For some instances, the question itself contains the answer, making images superfluous;
(c) Some samples are leaked into LLMs' training corpora can be "recalled" with the textual questions and answers directly;
(d) Some samples indiscernible to LLMs but solved by LVLMs without accessing images suggest leakage into LVLMs' multi-modal training data.

Cases of Lacking Visual Dependency (a & b)

Cases of Data Leakage in LLMs (c)

Cases of Data Leakage during LVLMs' Multi-modal Training Processes (d)

Evaluation of various LVLMs on 6 popular multi-modal benchmarks

For the "strategy" column, "LLM" refers to evaluating using the corresponding LLM base of the LVLM, while "LVLM-text" denotes evaluating LVLMs without accessing images.
We employ the 0-shot inference strategy for LLMs to align the evaluation protocols of LVLMs.
The highest results of the LVLM-text setting across the models are highlighted in bold and underlined.
For the entire LVLMs' results, please refer to the appendix.

After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark.

In MMStar, we display 6 core capabilities in the inner ring, with 18 detailed axes presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously balanced 250 samples. We further ensure a relatively even distribution across the 18 detailed axes.

CP (coarse perception), FP (fine-grained perception), IR(instance reasoning), LR (logical reasoning), ST (science & technology), MA (mathematics).
The best results are highlighted in bold and underlined. The worst results of multi-modal gain (MG) and multi-modal leakage (ML) metrics are in bold and red.

#	Model	LLM	Param.	CP	FP	IR	LR	ST	MA	Avg.✅	MG⬆️	ML⬇️
1	GPT4V (high) 🥇	GPT4-Turbo	-	76.6	51.4	66.6	55.8	42.6	49.8	57.1	43.6	1.3
2	InternLM-XComposer2 🥈	InternLM2-7B	7B	70.8	48.8	65.2	56.4	42.0	49.2	55.4	28.1	7.5
3	LLaVA-Next 🥉	NH2-Yi-34B	34B	66.4	52.0	62.4	46.0	32.4	53.6	52.1	29.4	2.4
4	GPT4V (low)	GPT4-Turbo	-	62.0	32.8	55.2	48.0	33.6	44.8	46.1	32.6	1.3
5	InternVL-Chat-v1.2	NH2-Yi-34B	40B	67.6	43.2	61.2	47.2	24.0	19.2	43.7	32.6	0.0
6	GeminiPro-Vision	GeminiPro	-	51.6	28.8	50.8	46.0	28.4	50.0	42.6	27.4	0.0
7	MiniCPM-V-2	MiniCPM-2B-sft	2B	60.4	31.6	45.2	45.2	29.2	32.4	40.7	18.7	3.5
8	Sphinx-X-MoE	Mixtral-8x7b	57B	58.4	40.8	47.6	35.2	19.2	32.0	38.9	14.8	1.0
9	Monkey-Chat	Qwen-7B	10B	57.6	36.4	51.6	33.2	26.4	24.4	38.3	13.5	17.6
10	Yi-VL	Yi-6B	6B	58.0	33.6	46.4	34.8	20.4	34.0	37.9	15.6	0.0
11	Qwen-VL-Chat	Qwen-7B	8B	59.6	32.0	50.8	29.2	22.0	31.6	37.5	23.9	0.0
12	Deepseek-VL	Deepseek-7B	8B	64.0	30.8	49.2	36.4	21.6	20.4	37.1	15.7	0.0
13	CogVLM-Chat	Vicuna-v1.5-7B	17B	66.8	36.8	49.2	31.2	23.6	11.6	36.5	14.9	0.0
14	Yi-VL	Yi-34B	34B	53.2	31.2	52.0	32.4	12.4	35.2	36.1	18.8	0.0
15	TinyLLaVA	Phi2-2.7B	3B	60.4	31.6	50.8	30.4	18.0	24.8	36.0	16.4	7.6
16	ShareGPT4V	Vicuna-v1.5-7B	7B	58.8	28.0	45.6	24.4	17.2	24.0	33.0	11.9	0.0
17	LLaVA-1.5	Vicuna-v1.5-13B	13B	58.8	28.0	41.6	24.4	18.4	25.6	32.8	13.9	0.0
18	LLaVA-1.5	Vicuna-v1.5-7B	7B	58.8	24.0	38.8	24.0	13.6	22.8	30.3	10.7	0.0
19	Random Choice	-	-	23.7	24.5	25.3	24.3	24.8	25.1	24.6	-	-

📃 BibTeX


      @article{chen2024we,
        title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
        author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
        journal={arXiv preprint arXiv:2403.20330},
        year={2024}
      }

Are We on the Right Way for Evaluating
Large Vision-Language Models?

Abstract

The Overlooked Issues for Evaluating LVLMs

Cases of Lacking Visual Dependency (a & b)

Cases of Data Leakage in LLMs (c)

Cases of Data Leakage during LVLMs' Multi-modal Training Processes (d)

Evaluation of various LVLMs on 6 popular multi-modal benchmarks

MMStar Benchmark

MMStar Leaderboard

📃 BibTeX