MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

[Submitted on 9 Sep 2024 (v1), last revised 19 Sep 2024 (this version, v3)]

Authors:Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin Li

View a PDF of the paper titled MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct, by Run Luo and 15 other authors

View PDF
HTML (experimental)

Abstract:The development of Multimodal Large Language Models (MLLMs) has seen significant advancements with increasing demands in various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches attempt to enhance MLLMs capabilities through diverse architectures, the gains have become increasingly marginal. Conversely, data-driven methods, which scale up image-text instruction data, are more effective but face limited data diversity and complexity challenges. The absence of high-quality data constitutes a significant development barrier for MLLMs. To address the data quality bottleneck, we propose MMEvol, a novel multimodal instruction data evolution framework. This framework iteratively improve data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that empowers MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broaden the diversity of instruction types, extend visual reasoning steps to improve cognitive reasoning abilities, and thoroughly explore fine-grained information within images to enhance visual understanding and robustness. To comprehensively evaluate the effectiveness of our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained with the initial seed data, the results demonstrate that our method achieves an average accuracy improvement of 3.1 percentage points. Furthermore, our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.

Submission history

From: Run Luo [view email]
[v1]
Mon, 9 Sep 2024 17:44:00 UTC (9,845 KB)
[v2]
Sun, 15 Sep 2024 13:32:08 UTC (9,847 KB)
[v3]
Thu, 19 Sep 2024 16:17:38 UTC (9,847 KB)

Source link
lol

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Submission history

By stp2y

Leave a Reply Cancel reply