VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models

[Submitted on 23 Sep 2024 (v1), last revised 15 Nov 2024 (this version, v2)]

View a PDF of the paper titled VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models, by Jingtao Cao and 2 other authors

View PDF
HTML (experimental)

Abstract:Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models’ ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP this http URL quantifies a model’s generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.