Home News Training Improved Text Embeddings with Large Language Models

Training Improved Text Embeddings with Large Language Models

Training Improved Text Embeddings with Large Language Models

Text embeddings are vector representations of words, sentences, paragraphs or documents that capture their semantic meaning. They function a core constructing block in lots of natural language processing (NLP) applications today, including information retrieval, query answering, semantic search and more.

vector embedding

Recent advances in large language models (LLMs) like GPT-3 have shown impressive capabilities in few-shot learning and natural language generation. Can we leverage LLMs to also advance the state of text embeddings? Of their paper “Improving Text Embeddings with Large Language Models“, researchers from Microsoft propose a novel method that achieves superior results by generating synthetic training data with LLMs and fine-tuning on it.

Challenges with Existing Methods

Traditional text embedding techniques like weighted averages of word vectors or TF-IDF fail to adequately capture the wealthy contextual information in text. Newer methods based on pre-trained language models like BERT obtain significantly better context-aware embeddings.

Nevertheless, they require complex multi-stage training pipelines:

  • Pre-train on billions of weakly labeled or artificial text pairs
  • Effective-tune on limited hand-curated datasets

This demands massive compute resources and human effort for data collection. The training data can be constrained in diversity and language coverage. As an illustration, the BEIR benchmark comprises datasets for under 15 retrieval tasks in English.

Existing methods predominantly use smaller BERT-style architectures because the backbone model. They’re unable to make the most of more advanced LLMs and related techniques.

Methodology: Synthetic Data Generation with LLMs

To beat these limitations, the researchers propose a novel single-stage training approach that leverages LLMs like GPT-3 and GPT-4 to generate diverse synthetic training data.

The important thing steps are:

  1. Task Taxonomy: Define a taxonomy that categorizes text embedding tasks into:
    • Asymmetric tasks (query and document not paraphrases e.g. search)
    • Symmetric tasks (query and document are paraphrases e.g. semantic similarity)
  2. Prompt Design: Create prompt templates tailored to every task type that guide the LLM to generate relevant training examples.
  3. Synthetic Data Generation: Prompt the LLM with the designed prompts to generate lots of of hundreds of (query, document) pairs covering a wide selection of semantic tasks across 93 languages.
  4. Model Training: Effective-tune a robust open-source LLM akin to Mistral on the synthetic data using contrastive loss.

This system allows creating ample training data for diverse tasks in multiple languages with none human labeling effort. By leveraging the knowledge already embedded in LLMs through pre-training on web-scale corpora, we are able to synthesize high-quality data precisely tailored for text embeddings.

The researchers reveal this with a 2-step prompting strategy:

  • Prompt GPT-4 to suggest potential retrieval tasks

Prompt for generating high-level retrieval tasks

    Prompt for generating high-level retrieval tasks
  • Prompt it again to generate (query, document) samples based on the suggested tasks

n generate (query, positive, hard negative) triplets

    n generate (query, positive, hard negative) triplets

Some key elements of the prompt design:

  • Natural language prompts for intuitive human-like instructions
  • Placeholders to encourage diversity (e.g. query length, clarity, document length)
  • Combining data from multiple templates for a similar task type
  • Weighting languages based on resource availability

In total, they were capable of generate 500k text embedding examples at a compute cost of 180M tokens. The dominant language was English (43%) followed by Polish, Japanese, Italian and others.

For model training, they opted for fine-tuning the open-source 7B parameter Mistral model as an alternative of smaller BERT-style architectures. Since Mistral was already pre-trained on massive text corpora, no additional contrastive pre-training was needed. Adding it provided negligible improvements.

The whole fine-tuning took lower than 1k steps, using a combination of synthetic and human-labeled data. This demonstrates the sample efficiency of the proposed approach.


The researchers evaluated their model on the MTEB benchmark, which covers diverse tasks across classification, clustering, semantic similarity, summarization and data retrieval.

Their model outperformed previous state-of-the-art by 2.4 points in average rating, establishing recent records for nearly every category:

Model Previous SOTA Proposed Model
Classification 76.0 78.5
Clustering 46.1 50.3
Pairwise Classification 87.1 88.3
Reranking 60.0 60.2
Retrieval 54.3 56.9
STS 83.1 84.6
Summarization 31.6 31.4
Average 64.2 66.6

Remarkably, even without using any labeled data and training solely on synthetic data, it achieved competitive accuracy – only 3.5 points behind the fully supervised model. This demonstrates the viability of generating text embeddings just using LLMs, without human annotation effort.

The researchers also evaluated on the multilingual MIRACL benchmark covering 18 languages. Their model outperformed previous best on high-resource languages but was weaker on low-resource ones. They hypothesize this may very well be mitigated by pre-training LLMs more extensively on low-resource languages.

In summary, text embeddings trained on LLM-generated synthetic data establish recent state-of-the-art results, while using simpler and more efficient training in comparison with prior multi-stage approaches. With further research intoprompt engineering and artificial data quality, this system could greatly advance multilingual text embeddings.


This work offers several useful takeaways:

  • LLMs like GPT-3 and GPT-4 have a formidable ability to generate high-quality synthetic training data for diverse NLP tasks when prompted appropriately. This may reduce reliance on human-labeled data.
  • For text embeddings, contrastive pre-training provides negligible gains over just fine-tuning models like Mistral that have already got trillion-scale pre-training. That is a very important insight into training efficiency.
  • Retrieval augmented generation methods are enabling LLMs to dynamically access external knowledge. Hence improving text embeddings is useful for enhancing these LLMs.
  • There is important room for improvement in low-resource languages. Multilingual LLMs pre-trained on more representative data could help close this gap.
  • Conceptually, language modeling and text embeddings are two sides of the identical coin – understanding language semantics. With synthetic data prompting, LLMs might be organically fine-tuned into embedders without complex pipelines.

Some promising directions for future work include:

  • Leveraging open-source LLMs like GPT-NeoX to generate synthetic data
  • Exploring lightweight post-training to adapt embedders to longer contexts
  • Development of prompt engineering techniques to manage quality and task coverage
  • Methods to enhance inference latency and storage costs for industrial usage

Beyond beating benchmarks, employing large language models to boost text embeddings opens up intriguing possibilities for the long run. As LLMs proceed to advance of their mastery over natural language, their aptitude for generating high-fidelity synthetic data is prone to improve as well.

Nevertheless, critical research directions remain to translate this potential into real-world impact.

Customization and Control

A key advantage of synthetic data is the power to programmatically generate examples tailored to specific needs. Because the paper demonstrated, prompt engineering allows creating training data for lots of of hundreds of embedding tasks.

Yet, current prompt design practices remain more an art than science. Developing systematic, reproducible methods to exactly control the properties of generated data would expand the applicability of this system.

As an illustration, techniques to modulate aspects just like the complexity, ambiguity and novelty of examples could help address robustness issues in downstream tasks. Dynamic prompt generation to match evolving real-world distributions is one other open challenge.

Training at Scale

While pre-trained LLMs already encode substantial linguistic knowledge, their data generation skills are prone to enhance further with additional scale. Models like GPT-4 trained on trillions of tokens of web text exhibit strong few-shot learning, but haven’t been optimized specifically for synthesizing training data.

Architectures and objectives tailored to bootstrapping self-supervised data generation at web-scale could substantially advance the standard and efficiency of this system. Efficient integration of retrieved knowledge to enrich learned knowledge is one other promising direction.

Multitask and Multilingual

Because the paper noted, improving performance on low-resource languages stays a difficulty. Moderately than pre-train a single massive LLM, another is training a fleet of smaller expert models that concentrate on particular data modalities or language domains.

Such an ensemble approach could help improve coverage over rare tasks and languages by sharing representations learned across experts. Continual learning to expand language and task expertise over time can be an exciting prospect.

In conclusion, this paper introduces an progressive concept of synthesizing training data from LLMs to create performant text embeddings. Their results reveal the effectiveness of this system, outperforming previous benchmarks. As LLMs and artificial data techniques progress, tapping into their knowledge to coach embedders could turn into a highly promising direction.


Please enter your comment!
Please enter your name here