Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) process data from different modalities like text, audio, image, and video.

Compared to text-only models, MLLMs achieve richer contextual understanding and can integrate information across modalities, unlocking new areas of application. Prime use cases of MLLMs include content creation, personalized recommendations, and human-machine interaction.

Examples of MLLMs that process image and text data include Microsoft’s Kosmos-1, DeepMind’s Flamingo, and the open-source LLaVA. Google’s PaLM-E additionally handles information about a robot’s state and surroundings.

Combining different modalities and dealing with different types of data comes with some challenges and limitations, such as alignment of heterogeneous data, inherited biases from pre-trained models, and lack of robustness.

How would you translate this sentence: “The glasses are broken.” into French: “Les verres sont cases.” or “Les lunettes sont cases.”? What if you have an image? Will you be able to choose the correct translation? As humans, we use different modalities daily to enhance communication. Machines can do the same.

Access to visual context can resolve ambiguity when translating between languages. In this example, the image of drinking glasses resolves the ambiguity in the meaning of “glasses” when translating the sentence from English to French. | Modified based on: source

While Large Language Models (LLMs) have shown impressive capabilities in understanding complex text, they are limited to a single data modality. However, many tasks span several modalities.

This article explores Multimodal Large Language Models, exploring their core functionalities, challenges, and potential for various machine-learning domains.

What is a multimodal large language model?

Let’s break down the concept of Multimodal Large Language Models (MLLMs) by first understanding the terms “modal” and “multimodal:”

“Modal” refers to a particular way of communicating or perceiving information. It’s like a channel through which we receive and express ourselves. Some of the common modalities are:

Visual: Sight, including images, videos, and spatial information.
Auditory: Hearing, including sounds, music, and speech.
Textual: Written language, including words, sentences, and documents.
Haptic: Touch, including sensations of texture, temperature, and pressure.
Olfactory: Smell

“Multimodal” refers to incorporating various modalities to create a richer understanding of the task, e.g., as on a website or in a blog post that integrates text with visuals.

MLLMs can process not just text but other modalities as well. They are trained on samples containing different modalities, which allows them to develop joint representations and utilize multimodal information to solve tasks.

Why do we need multimodal LLMs?

Many industries heavily rely on multimodality, particularly those that handle a blend of data modalities. For example, MLLMs can be used in a healthcare setting to process patient reports comprising doctor notes (text), treatment plans (structured data), and X-rays or MRI scans (images).

Example of a multi-modal model. The model is trained on X-rays, medical reports, actions, and texts describing the diagnosis and outcome. This way, the model learns to use visual and textual information to predict potential diagnoses. | Modified based on: source

MLLMs process and integrate information from different modalities (i.e., text, image, video, and audio), essential to solving many tasks. Some prominent applications are:

Content creation: MLLMs can generate image captions, transform text into visually descriptive narratives, or create multimedia presentations, making them valuable tools for creative and professional industries.

Enhanced human-machine interaction: By understanding and responding to inputs from diverse modalities such as text, speech, and images, MLLMs enable more natural communication. This can enrich the user experience in applications like virtual assistants, chatbots, and smart devices.

Personalized recommendations: MLLMs contribute to refining recommendation systems by analyzing user preferences across diverse modalities. Whether suggesting movies based on textual reviews, recommending products through image recognition, or personalizing content recommendations across varied formats, these models elevate the precision and relevance of recommendations.

Domain-specific problem solving: MLLMs are adaptable and invaluable in addressing challenges across various domains. In healthcare, their capability to interpret medical images aids in diagnostics, while in education, they enhance learning experiences by providing enriched materials that seamlessly combine text and visuals.

How do multimodal LLMs work?

A typical multimodal LLM has three primary modules:

The input module comprises specialized neural networks for each specific data type that output intermediate embeddings.
The fusion module converts the intermediate embeddings into a joint representation.
The output module generates outputs based on the task and the processed information. An output could be, e.g., a text, a classification (like “dog” for an image), or an image. Some MLLMs, like Google’s Gemini family, can produce outputs in more than one modality.

Basic structure of a multimodal LLM. Different modalities are processed by separate input modules. Then, the extracted information is joined in the fusion module. The output module (in this case, a classifier) generates the output in the desired modality.

Examples of multimodal LLMs

Microsoft: Kosmos-1

Kosmos-1 (GitHub) is a multimodal LLM created by Microsoft for natural language and perception-intensive tasks. It can perform visual dialogue, visual explanation, visual question answering, image captioning, math equations, OCR, and zero-shot image classification with and without descriptions.

Architecture and training

Kosmos-1 processes inputs consisting of text and encoded image embeddings. Image embeddings are obtained through the pre-trained CLIP ViT-L/14 (GitHub) model. An embedding module processes this input before feeding it into a transformer-based decoder based on Magneto.

Kosmos-1 used the same initialization as the Magneto transformer for better optimization stability. To capture position information more precisely and better generalize to different sequence lengths (short sequences for training, long ones during testing), Kosmos-1 used xPOS as a relative position encoder.

Kosmos-1 has about 1.6 billion parameters in total, which is smaller than rival models like Flamingo, LLaVA, or GPT-4o. It was trained from scratch on web-scale multimodal corpora (text corpora, image caption pairs, and interleave image-text data).

A main limitation of Kosmos-1 is the limited number of input tokens (2,048) across text and image modalities.

Performance

The creators of Kosmos-1 proposed the Raven IQ test dataset to evaluate the nonverbal reasoning capabilities of MLLMs. This is the first time that a model is tested on nonverbal reasoning. The experimental results from the Kosmos-1 paper show that although the performance of Kosmos-1 is slightly better than that of random choice (random choosing one of the options), it is still far from the average results of adults for the same test. Nevertheless, this shows that MLLMs have the capability of nonverbal reasoning by aligning perception with language models.)

Experimental results published in the Kosmos-1 paper show that MLLMs benefit from performing cross-modal transfer, i.e., learning from one modality and transferring the knowledge to other modalities is more beneficial than using only one modality.

Microsoft published promising results for Kosmos-1 on the OCR-free language understanding task. In this task, the model reads and comprehends the meaning of words and sentences directly from the images. Microsoft also demonstrated that providing descriptions in the context improves the accuracy of zero-shot image classification tasks.

Examples of different Kosmos-1 tasks. The modal can explain an image (1, 2) or answer questions based on an image (3, 4). Kosmos-1 can also extract information from a text in an image (5) or answer math questions (6). The model is able to combine these capabilities to answer questions that require locating specific information in an image (7, 8) | Source

Chain-of-thoughts prompting with Kosmos-1. In the first stage, given an image, a prompt is used to guide the model in generating a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results. | Source

DeepMind: Flamingo

Flamingo architecture overview. Visual data is processed through a pretrained, frozen image encoder to extract image embeddings. These embeddings are passed through a Preceiver Sampler, trained from scratch, which outputs a fixed number of embeddings. The fixed image embeddings and text tokens are fed into gated cross-attention dense blocks, inserted between the frozen LLM blocks, and trained from scratch. The model produces free-form text as output. | Source

Flamingo, a vision language model (VLM) developed by DeepMind, can perform various multimodal tasks, including image captioning, visual dialogue, and visual question answering (VQA). Flamingo models take interleaved image data and text as input and generate free-form text.

Flamingo consists of pre-trained vision and language models connected by a “Perceiver Resampler.” The Perceiver Resampler takes as input a variable number of image or video features from the pre-trained vision encoder and returns a fixed number of visual outputs. A pre-trained and frozen Normalizer-Free ResNet (NFNET) is used as a vision encoder, and a frozen Chinchilla is used as the language model. Gated cross-attention dense blocks (GATED XATTN-DENSE) are inserted between frozen LLM blocks and trained from scratch. The largest Flamingo model has 80B parameters and is trained on three datasets scraped from the web: interleaved image and text, image-text, and video-text pairs.

Experimental results on 16 multimodal image/video and language tasks show that Flamingo 80B models are more effective than fine-tuned models for specific tasks. However, as Flamingo focuses more on open-ended tasks, its performance on classification tasks is not as good as that of contrastive models like BASIC, CLI, and ALIGN.

Some limitations that Flamingo inherits from the pre-trained LLM used include hallucinations, poor sample efficiency during training, and poor generalizations for sequences that are longer than the ones used during training. Other limitations that many VLMs struggle with are outputting offensive language, toxicity, propagating social biases and stereotypes, and leaking private information. One way to mitigate these limitations is to filter them out of the training data and exclude them during evaluation.

LLaVA

The Large Language and Vision Assistant (LLaVA) is an end-to-end trained multimodal LLM that integrates the CLIP ViT-L/14 vision encoder and the Vicuna (a chat model created by fine-tuning Llama 2) for general-purpose visual and language understanding.

Given an input image, the pre-trained CLIP ViT-L/14 vision encoder extracts the vision features, which are transformed into the word embedding space using a simple linear layer. Vicuna was chosen as the LLM model because it is the best open-source instruction-following model for language tasks.

Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images Xv, which are then mapped into the word embedding space using a linear projection W. — Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images X_v, which are then mapped into the word embedding space using a linear projection W. | Source

LLaVA is trained using a two-stage instruction-tuning process. In the first pre-training stage for feature alignment, both the vision encoder and LLM weights are frozen, and the projection matrix is updated to align image features with the pre-trained LLM word embedding. In the second stage, end-to-end fine-tuning is performed to optimize the model for multimodal chatbot interactions and reasoning within the science domain.

Experimental results show that LLaVA 7B has better instruction-tuning capabilities than GPT-4 and Flamingo 80B despite having fewer parameters. LLaVA can follow user instructions and give a more comprehensive answer than GPT-4. LLaVA also outperforms GPT-4 on the ScienceQA dataset, which has multimodal multiple-choice questions from natural, social, and language sciences.

LLaVA has some limitations, including its perception of images as a “bag of patches,” failing to grasp the complex semantics within them. Similar to Flamingo, it inherits biases from both vision and language encoders and is prone to hallucinations and misinformation. Contrary to Flamingo, LLaVA cannot handle multiple images due to its lack of instructions.

This example shows LLaVA's capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image. — This example shows LLaVA’s capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image. | Source

Google: PaLM-E

Google developed an embodied language model, PaLM-E, to incorporate continuous sensor modalities into language models and establish the link between words and perceptions.

PaLM-E is a general-purpose MLLM for embodied reasoning, visual language, and language tasks. PaLM-E uses multimodal sentences, where inputs from different modalities (i.e., images in blue, state estimate of a robot in green) are inserted alongside text tokens (in orange) as input to an LLM and are trained end-to-end. PaLM-E can perform different tasks like robotic planning, visual question answering (VQA), and image captioning. | Source

Architecture and training

PaLM-E is a decoder-only LLM that auto-regressively generates text using a multimodal prompt consisting of text, tokenized image embeddings, and state estimates representing quantities like a robot’s position, orientation, and velocity.

PaLM-E combines PaLM, a decoder-only LLM with 540 billion parameters, and the ViT vision transformer by projecting the latter’s image representations into the former’s input token space. The same approach, which relies on a learned transformation function, is used for projecting state estimates.

Performance

Experimental results show that PALM-E outperforms other baselines like SayCan and PALI in different robotic domains and tasks. This shows that combining pre-trained PALM and ViT with the full mixture of robotics and general visual-language data increases the performance compared to training individual models on individual tasks. Moreover, PALM-E outperforms Flamingo in VQA tasks and PALM in language tasks.

PALM-E 562B has many capabilities, including zero-shot multi-modal chain of thought (CoT) reasoning, multi-image reasoning, OCR-free math reasoning, image captioning, VQA, and few-shot prompting.

Challenges, limitations, and future directions of MLLMs

Expanding LLMs to other modalities comes with challenges regarding data quality, interpretation, safety, and generalization. In a survey paper, Paul Liang et al. proposed a new taxonomy to characterize the challenges and limitations of large multimodal language models:

Representation: How can one represent different modalities in a meaningful and comprehensive manner?
Fusion, i.e., integrating two or more modalities and reducing the number of separate representations, is a closely related challenge. Fusion can happen after unimodal encoders capture unique representations of different modalities or directly using raw modalities, which is more challenging as data is heterogeneous.

Representation coordination aims to organize different modalities in a shared coordinate space, such as Euclidian distance. The objective is to position similar modalities close together and put modalities that are not equivalent far away. For instance, the goal is that the representation of the text “a bike” and an image of a bike are placed close together in cosine distance but far away from an image of a cat.

Human cognition offers valuable insights into developing and further improving multimodal models. Understanding how the brain processes different modalities and combining them can be a promising direction for proposing new approaches to multimodal learning and enabling more effective analysis of complex data.

Alignment: Another challenge is identifying cross-modal connections and interactions between elements of different modalities. For instance, how can we align gestures with speech when a person is talking? Or how can we align an image with a description?
When the elements of multiple modalities are discrete (i.e., there is a clear segmentation between elements, like words in a text) and supervised data exists, contrastive learning is used. It matches the representations of the same concepts expressed in different modalities (e.g., the word “car” with an image of a car).

If the ground truth is unavailable, the alignment is done with all the elements of the modalities to learn the necessary connections and matchings between them. For example, aligning video clips with text descriptions when there are no ground truth labels that link descriptions with video clips requires comparing each video embedding with each text embedding. A similarity score (i.e., cosine similarity) is calculated for all pairs and aligns the modalities.

Alignment is more challenging when elements of a modality are continuous (like time-series data) or data does not contain clear semantic boundaries (e.g., MRI images). Clustering can be used to group continuous data based on semantic similarity to achieve modality alignment.

Further, current multimodal models struggle with long-range sequences and cannot learn interactions over long periods. For instance, aligning the text “After 25 minutes in the oven, the cupcakes are golden brown” with the correct scene in a video requires understanding that “25 minutes in the oven” corresponds to a specific scene later in the video. Capturing and aligning long-term interactions that happen very far in time and space is challenging and complex, but it is an important and promising future direction that needs to be explored.
Reasoning: Reasoning is a complex process that involves drawing conclusions from knowledge through multiple logical steps and observations.
One reasoning-related challenge in MLLMs is structure modeling, which involves learning and representing the relationships over which reasoning happens. Understanding hierarchical relationships where smaller components (atoms) are combined to create larger ones (molecules) is essential for complex reasoning.

Another challenge is encoding or representing multimodal concepts during reasoning so that they are interpretable and effective using attention mechanisms, language, or symbols. It is very important to understand how to go from low-level representations (e.g., pixels of an image or words) to high-level concepts (e.g., “What color is the jacket?”) while still being interpretable by humans.

Understanding the reasoning process of the trained models and how they combine elements from different modalities (i.e., text, vision, audio) is very important for their transparency, reliability, and performance. This will help to discover potential biases and limitations in the reasoning process of MLLMs, enabling the development of robust models to overcome these challenges.
Generation: Research is ongoing on generating meaningful outputs that reflect cross-modal interaction and are structured and coherent.
Generative models focus on generating raw modalities (text, images, or videos) and capturing the relationships and interactions between different modalities. For instance, guided text summarization uses input modalities such as images, video, or audio to compress the data and summarize the most relevant and important information from the original content.

Multimodal translation maps one modality to another while respecting semantic connections and information content. Generating novel high-dimensional data conditioned on initial inputs is extremely challenging. It has to preserve semantics, be meaningful and coherent, and capture many possible generations (different styles, colors, and shapes of the same scene).

One of the main challenges of multimodal generation is the difficulty of evaluating the generated content, primarily when ethical issues (e.g., generating deepfakes, hate speech, and fake news) are involved. Evaluating user studies is time-consuming, costly, and biased.

An insightful future work will be to study if the risk for the above ethical issues is reduced or increased when using a multimodal dataset and if there are ethical issues specific to multimodal generations. Multimodal datasets may reduce ethical issues as they are more diverse and contextually complete and may improve model fairness. On the other hand, the biases from one modality can interact and amplify biases in other modalities, leading to complex ethical issues (i.e., combining video with text metadata may reveal sensitive information).)

Transference: In multimodal modeling, transference refers to the process of transferring knowledge from one modality (the second modality) to another (the primary modality) when the primary modality’s resources are limited (e.g., lack of annotated data, unreliable labels, noisy inputs). By leveraging the information from the second modality, the primary modality can enhance performance and learn new capabilities, which would not be possible without the shared information.
In cross-modal transfer settings, large-scale pre-trained models are fine-tuned for specific downstream tasks with a focus on the primary modality. For example, fine-tuning pre-trained frozen large language models for image captioning. On the other hand, multimodal co-learning aims to transfer the learned information by sharing intermediate spaces between modalities. In this case, a single joint model is used across all modalities. For instance, having both image and text modalities during training and using the model for image classification. Contrary model induction, exemplified by co-training, promotes independent training of models and only exchanges their model predictions (outputs) to enable information transfer while maintaining separation.

Learning from many modalities increases the data heterogeneity and complexity challenges during data processing. Dealing with modalities that aren’t all present simultaneously is a direction that needs further exploration to enhance multimodality models’ performance.

Quantification: Quantification aims to understand better and improve multimodal models’ reliability, interpretability, and robustness. Understanding the dimensions of heterogeneity and their effect on multimodal learning and modeling is very important. Exploring interactions and connections of multimodal modalities enhances the understanding of modality interconnections of the trained models. Improving how the multimodal models are trained and optimized is crucial to achieving better generalization, usability, and efficiency.
Having formal guidelines and theories for evaluating which modalities are beneficial or harmful (adversarial attacks) is a critical challenge. Understanding what modalities to select and compare them in a systematic way is very important for improving multimodal models. Furthermore, it is essential to interpret and explain complex relationships and patterns of the multimodal models before employing them in real-world applications. For instance, recognizing social biases of the data (text or image) is key to ensuring fairness while guaranteeing the robustness of the model against noisy or out-of-distribution modalities. These unresolved core challenges require thorough analysis to ensure that multimodal models can be reliably applied across different domains.

As this extensive list of open research questions and practical challenges shows, multimodal LLMs are still in their early stages. The LLaVA GitHub repository and the unit on multi-modal models in the Hugging Face Community Computer Vision Course are excellent resources to dive deeper and get hands-on experience training and fine-tuning MLLMs.

Was the article useful?

Explore more content topics:

Source link
lol

Multimodal Large Language Models

What is a multimodal large language model?

Why do we need multimodal LLMs?

How do multimodal LLMs work?

Examples of multimodal LLMs

Microsoft: Kosmos-1

Architecture and training

Performance

DeepMind: Flamingo

LLaVA

Google: PaLM-E

Architecture and training

Performance

Challenges, limitations, and future directions of MLLMs

Was the article useful?

Explore more content topics:

By stp2y

Leave a Reply Cancel reply