NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

stp2yJanuary 13, 20250 Comments

Architecture of OpenAI

[Submitted on 27 May 2024 (v1), last revised 9 Jan 2025 (this version, v2)]

View a PDF of the paper titled NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, by Chankyu Lee and 6 other authors

View PDF
HTML (experimental)

Abstract:Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024 and August 30, 2024, respectively) across 56 embedding tasks, demonstrating the sustained effectiveness of the proposed methods over time. Additionally, it achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB.

Submission history

From: Wei Ping [view email]
[v1]
Mon, 27 May 2024 17:59:45 UTC (99 KB)
[v2]
Thu, 9 Jan 2025 22:27:06 UTC (315 KB)

Source link
lol

By stp2y