View a PDF of the paper titled Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data, by Shuhao Gu and 25 other authors
Abstract:Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: this https URL.
Submission history
From: Shuhao Gu [view email]
[v1]
Thu, 24 Oct 2024 09:03:48 UTC (1,629 KB)
[v2]
Mon, 6 Jan 2025 12:48:47 UTC (7,757 KB)
Source link
lol