Home Artificial Intelligence Text Embeddings: Comprehensive Guide Evolution of Embeddings Calculating embeddings Distance between vectors Visualising embeddings Practical applications Summary Reference

Text Embeddings: Comprehensive Guide Evolution of Embeddings Calculating embeddings Distance between vectors Visualising embeddings Practical applications Summary Reference

0
Text Embeddings: Comprehensive Guide
Evolution of Embeddings
Calculating embeddings
Distance between vectors
Visualising embeddings
Practical applications
Summary
Reference

As human beings, we will read and understand texts (at the least a few of them). Computers in opposite “think in numbers”, so that they can’t mechanically grasp the meaning of words and sentences. If we wish computers to grasp the natural language, we want to convert this information into the format that computers can work with — vectors of numbers.

People learned how you can convert texts into machine-understandable format a few years ago (one in every of the primary versions was ASCII). Such an approach helps render and transfer texts but doesn’t encode the meaning of the words. At the moment, the usual search technique was a keyword search while you were just searching for all of the documents that contained specific words or N-grams.

Then, after many years, embeddings have emerged. We will calculate embeddings for words, sentences, and even images. Embeddings are also vectors of numbers, but they will capture the meaning. So, you should use them to do a semantic search and even work with documents in numerous languages.

In this text, I would really like to dive deeper into the embedding topic and discuss all the main points:

  • what preceded the embeddings and the way they evolved,
  • how you can calculate embeddings using OpenAI tools,
  • how you can define whether sentences are close to one another,
  • how you can visualise embeddings,
  • essentially the most exciting part is how you can use embeddings in practice.

Let’s move on and learn in regards to the evolution of embeddings.

We’ll start our journey with a temporary tour into the history of text representations.

Bag of Words

Essentially the most basic approach to converting texts into vectors is a bag of words. Let’s take a look at one in every of the famous quotes of Richard P. Feynman“We’re lucky to live in an age wherein we’re still making discoveries”. We’ll use it as an instance a bag of words approach.

Step one to get a bag of words vector is to separate the text into words (tokens) after which reduce words to their base forms. For instance, “running” will transform into “run”. This process known as stemming. We will use the NLTK Python package for it.

from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

text = 'We're lucky to live in an age wherein we're still making discoveries'

# tokenization - splitting text into words
words = word_tokenize(text)
print(words)
# ['We', 'are', 'lucky', 'to', 'live', 'in', 'an', 'age', 'in', 'which',
# 'we', 'are', 'still', 'making', 'discoveries']

stemmer = SnowballStemmer(language = "english")
stemmed_words = list(map(lambda x: stemmer.stem(x), words))
print(stemmed_words)
# ['we', 'are', 'lucki', 'to', 'live', 'in', 'an', 'age', 'in', 'which',
# 'we', 'are', 'still', 'make', 'discoveri']

Now, we’ve a listing of base types of all our words. The subsequent step is to calculate their frequencies to create a vector.

import collections
bag_of_words = collections.Counter(stemmed_words)
print(bag_of_words)
# {'we': 2, 'are': 2, 'in': 2, 'lucki': 1, 'to': 1, 'live': 1,
# 'an': 1, 'age': 1, 'which': 1, 'still': 1, 'make': 1, 'discoveri': 1}

Actually, if we desired to convert our text right into a vector, we’d need to have in mind not only the words we’ve within the text but the entire vocabulary. Let’s assume we even have “i”, “you” and ”study” in our vocabulary and let’s create a vector from Feynman’s quote.

Graph by writer

This approach is sort of basic, and it doesn’t have in mind the semantic meaning of the words, so the sentences “the girl is studying data science” and “the young woman is learning AI and ML” won’t be close to one another.

TF-IDF

A rather improved version of the bag of the words approach is TF-IDF (Term Frequency — Inverse Document Frequency). It’s the multiplication of two metrics.

  • Term Frequency shows the frequency of the word within the document. Essentially the most common solution to calculate it’s to divide the raw count of the term on this document (like within the bag of words) by the overall variety of terms (words) within the document. Nevertheless, there are numerous other approaches like just raw count, boolean “frequencies”, and different approaches to normalisation. You possibly can learn more about different approaches on Wikipedia.
  • Inverse Document Frequency denotes how much information the word provides. For instance, the words “a” or “that” don’t provide you with any additional information in regards to the document’s topic. In contrast, words like “ChatGPT” or “bioinformatics” can allow you to define the domain (but not for this sentence). It’s calculated because the logarithm of the ratio of the overall variety of documents to those containing the word. The closer IDF is to 0 — the more common the word is and the less information it provides.

So, ultimately, we are going to get vectors where common words (like “I” or “you”) may have low weights, while rare words that occur within the document multiple times may have higher weights. This strategy will give a bit higher results, however it still can’t capture semantic meaning.

The opposite challenge with this approach is that it produces pretty sparse vectors. The length of the vectors is the same as the corpus size. There are about 470K unique words in English (source), so we may have huge vectors. For the reason that sentence won’t have greater than 50 unique words, 99.99% of the values in vectors will probably be 0, not encoding any info. Taking a look at this, scientists began to take into consideration dense vector representation.

Word2Vec

One of the crucial famous approaches to dense representation is word2vec, proposed by Google in 2013 within the paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al.

There are two different word2vec approaches mentioned within the paper: Continuous Bag of Words (once we predict the word based on the encompassing words) and Skip-gram (the other task — once we predict context based on the word).

Figure from the paper by Mikolov et al. 2013 | source

The high-level idea of dense vector representation is to coach two models: encoder and decoder. For instance, within the case of skip-gram, we’d pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”.

Scheme by writer

This model began to have in mind the meaning of the words because it’s trained on the context of the words. Nevertheless, it ignores morphology (information we will get from the word parts, for instance, that “-less” means the shortage of something). This drawback was addressed later by taking a look at subword skip-grams in GloVe.

Also, word2vec was able to working only with words, but we would really like to encode whole sentences. So, let’s move on to the subsequent evolutional step with transformers.

Transformers and Sentence Embeddings

The subsequent evolution was related to the transformers approach introduced within the “Attention Is All You Need” paper by Vaswani et al. Transformers were capable of produce information-reach dense vectors and turn into the dominant technology for contemporary language models.

I won’t cover the main points of the transformers’ architecture because it’s not so relevant to our topic and would take a whole lot of time. In case you’re all in favour of learning more, there are a whole lot of materials about transformers, for instance, “Transformers, Explained” or “The Illustrated Transformer”.

Transformers mean you can use the identical “core” model and fine-tune it for various use cases without retraining the core model (which takes a whole lot of time and is sort of costly). It led to the rise of pre-trained models. One in all the primary popular models was BERT (Bidirectional Encoder Representations from Transformers) by Google AI.

Internally, BERT still operates on a token level much like word2vec, but we still need to get sentence embeddings. So, the naive approach could possibly be to take a median of all tokens’ vectors. Unfortunately, this approach doesn’t show good performance.

This problem was solved in 2019 when Sentence-BERT was released. It outperformed all previous approaches to semantic textual similarity tasks and allowed the calculation of sentence embeddings.

It’s an enormous topic so we won’t give you the option to cover all of it in this text. So, for those who’re really interested, you may learn more in regards to the sentence embeddings in this text.

We’ve briefly covered the evolution of embeddings and got a high-level understanding of the idea. Now, it’s time to maneuver on to practice and lear how you can calculate embeddings using OpenAI tools.

In this text, we will probably be using OpenAI embeddings. We’ll try a brand new model text-embedding-3-small that was released only in the near past. The brand new model shows higher performance in comparison with text-embedding-ada-002:

  • The typical rating on a widely used multi-language retrieval (MIRACL) benchmark has risen from 31.4% to 44.0%.
  • The typical performance on a continuously used benchmark for English tasks (MTEB) has also improved, rising from 61.0% to 62.3%.

OpenAI also released a brand new larger model text-embedding-3-large. Now, it’s their best performing embedding model.

As an information source, we will probably be working with a small sample of Stack Exchange Data Dump — an anonymised dump of all user-contributed content on the Stack Exchange network. I’ve chosen a bunch of topics that look interesting to me and sample 100 questions from each of them. Topics range from Generative AI to coffee or bicycles so that we are going to see quite a wide range of topics.

First, we want to calculate embeddings for all our Stack Exchange questions. It’s price doing it once and storing results locally (in a file or vector storage). We will generate embeddings using the OpenAI Python package.

from openai import OpenAI
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("n", " ")
return client.embeddings.create(input = [text], model=model)
.data[0].embedding

get_embedding("We're lucky to live in an age wherein we're still making discoveries.")

In consequence, we got a 1536-dimension vector of float numbers. We will now repeat it for all our data and begin analysing the values.

The first query you would possibly have is how close the sentences are to one another by meaning. To uncover answers, let’s discuss the concept of distance between vectors.

Embeddings are literally vectors. So, if we wish to grasp how close two sentences are to one another, we will calculate the gap between vectors. A smaller distance can be corresponding to a better semantic meaning.

Different metrics will be used to measure the gap between two vectors:

  • Euclidean distance (L2),
  • Manhattant distance (L1),
  • Dot product,
  • Cosine distance.

Let’s discuss them. As an easy example, we will probably be using two 2D vectors.

vector1 = [1, 4]
vector2 = [2, 2]

Euclidean distance (L2)

Essentially the most standard solution to define distance between two points (or vectors) is Euclidean distance or L2 norm. This metric is essentially the most commonly utilized in day-to-day life, for instance, once we are talking in regards to the distance between 2 towns.

Here’s a visible representation and formula for L2 distance.

Image by writer

We will calculate this metric using vanilla Python or leveraging the numpy function.

import numpy as np

sum(list(map(lambda x, y: (x - y) ** 2, vector1, vector2))) ** 0.5
# 2.2361

np.linalg.norm((np.array(vector1) - np.array(vector2)), ord = 2)
# 2.2361

Manhattant distance (L1)

The opposite commonly used distance is the L1 norm or Manhattan distance. This distance was called after the island of Manhattan (Recent York). This island has a grid layout of streets, and the shortest routes between two points in Manhattan will probably be L1 distance since it is advisable follow the grid.

Image by writer

We may implement it from scratch or use the numpy function.

sum(list(map(lambda x, y: abs(x - y), vector1, vector2)))
# 3

np.linalg.norm((np.array(vector1) - np.array(vector2)), ord = 1)
# 3.0

Dot product

One other solution to take a look at the gap between vectors is to calculate a dot or scalar product. Here’s a formula and we will easily implement it.

Image by writer
sum(list(map(lambda x, y: x*y, vector1, vector2)))
# 11

np.dot(vector1, vector2)
# 11

This metric is a bit tricky to interpret. On the one hand, it shows you whether vectors are pointing in a single direction. Then again, the outcomes highly rely upon the magnitudes of the vectors. For instance, let’s calculate the dot products between two pairs of vectors:

  • (1, 1) vs (1, 1)
  • (1, 1) vs (10, 10).

In each cases, vectors are collinear, however the dot product is ten times greater within the second case: 2 vs 20.

Cosine similarity

Very often, cosine similarity is used. Cosine similarity is a dot product normalised by vectors’ magnitudes (or normes).

Image by writer

We will either calculate every part ourselves (as previously) or use the function from sklearn.

dot_product = sum(list(map(lambda x, y: x*y, vector1, vector2)))
norm_vector1 = sum(list(map(lambda x: x ** 2, vector1))) ** 0.5
norm_vector2 = sum(list(map(lambda x: x ** 2, vector2))) ** 0.5

dot_product/norm_vector1/norm_vector2

# 0.8575

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(
np.array(vector1).reshape(1, -1),
np.array(vector2).reshape(1, -1))[0][0]

# 0.8575

The function cosine_similarity expects 2D arrays. That’s why we want to reshape the numpy arrays.

Let’s talk a bit in regards to the physical meaning of this metric. Cosine similarity is the same as the cosine between two vectors. The closer the vectors are, the upper the metric value.

Image by writer

We will even calculate the precise angle between our vectors in degrees. We get results around 30 degrees, and it looks pretty reasonable.

import math
math.degrees(math.acos(0.8575))

# 30.96

What metric to make use of?

We’ve discussed alternative ways to calculate the gap between two vectors, and you would possibly start interested by which one to make use of.

You should use any distance to check the embeddings you’ve. For instance, I calculated the typical distances between different clusters. Each L2 distance and cosine similarity show us similar pictures:

  • Objects inside a cluster are closer to every apart from to other clusters. It’s a bit tricky to interpret our results since for L2 distance, closer means lower distance, while for cosine similarity — the metric is higher for closer objects. Don’t get confused.
  • We will spot that some topics are really close to one another, for instance, “politics” and “economics” or “ai” and “datascience”.
Image by writer
Image by writer

Nevertheless, for NLP tasks, one of the best practice is generally to make use of cosine similarity. Some reasons behind it:

  • Cosine similarity is between -1 and 1, while L1 and L2 are unbounded, so it’s easier to interpret.
  • From the sensible perspective, it’s simpler to calculate dot products than square roots for Euclidean distance.
  • Cosine similarity is less affected by the curse of dimensionality (we are going to discuss it in a second).

OpenAI embeddings are already normed, so dot product and cosine similarity are equal on this case.

You would possibly spot in the outcomes above that the difference between inter- and intra-cluster distances just isn’t so big. The foundation cause is the high dimensionality of our vectors. This effect known as “the curse of dimensionality”: the upper the dimension, the narrower the distribution of distances between vectors. You possibly can learn more details about it in this text.

I would really like to briefly show you the way it really works so that you just get some intuition. I calculated a distribution of OpenAI embedding values and generated sets of 300 vectors with different dimensionalities. Then, I calculated the distances between all of the vectors and draw a histogram. You possibly can easily see that the rise in vector dimensionality makes the distribution narrower.

Graph by writer

We’ve learned how you can measure the similarities between the embeddings. With that we’ve finished with a theoretical part and moving to more practical part (visualisations and practical applications). Let’s start with visualisations because it’s all the time higher to see your data first.

The perfect solution to understand the information is to visualise it. Unfortunately, embeddings have 1536 dimensions, so it’s pretty difficult to have a look at the information. Nevertheless, there’s a way: we could use dimensionality reduction techniques to project vectors in two-dimensional space.

PCA

Essentially the most basic dimensionality reduction technique is PCA (Principal Component Evaluation). Let’s try to make use of it.

First, we want to convert our embeddings right into a 2D numpy array to pass it to sklearn.

import numpy as np
embeddings_array = np.array(df.embedding.values.tolist())
print(embeddings_array.shape)
# (1400, 1536)

Then, we want to initialise a PCA model with n_components = 2 (because we wish to create a 2D visualisation), train the model on the entire data and predict recent values.

from sklearn.decomposition import PCA

pca_model = PCA(n_components = 2)
pca_model.fit(embeddings_array)

pca_embeddings_values = pca_model.transform(embeddings_array)
print(pca_embeddings_values.shape)
# (1400, 2)

In consequence, we got a matrix with just two features for every query, so we could easily visualise it on a scatter plot.

fig = px.scatter(
x = pca_embeddings_values[:,0],
y = pca_embeddings_values[:,1],
color = df.topic.values,
hover_name = df.full_text.values,
title = 'PCA embeddings', width = 800, height = 600,
color_discrete_sequence = plotly.colours.qualitative.Alphabet_r
)

fig.update_layout(
xaxis_title = 'first component',
yaxis_title = 'second component')
fig.show()

Image by writer

We will see that questions from each topic are pretty close to one another, which is nice. Nevertheless, all of the clusters are mixed, so there’s room for improvement.

t-SNE

PCA is a linear algorithm, while a lot of the relations are non-linear in real life. So, we may not give you the option to separate the clusters due to non-linearity. Let’s try to make use of a non-linear algorithm t-SNE and see whether it can give you the option to indicate higher results.

The code is sort of an identical. I just used the t-SNE model as a substitute of PCA.

from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, random_state=42)
tsne_embeddings_values = tsne_model.fit_transform(embeddings_array)

fig = px.scatter(
x = tsne_embeddings_values[:,0],
y = tsne_embeddings_values[:,1],
color = df.topic.values,
hover_name = df.full_text.values,
title = 't-SNE embeddings', width = 800, height = 600,
color_discrete_sequence = plotly.colours.qualitative.Alphabet_r
)

fig.update_layout(
xaxis_title = 'first component',
yaxis_title = 'second component')
fig.show()

The t-SNE result looks way higher. A lot of the clusters are separated except “genai”, “datascience” and “ai”. Nevertheless, it’s pretty expected — I doubt I could separate these topics myself.

Taking a look at this visualisation, we see that embeddings are pretty good at encoding semantic meaning.

Also, you may make a projection to three-dimensional space and visualise it. I’m undecided whether it will be practical, but it might probably be insightful and fascinating to play with the information in 3D.

tsne_model_3d = TSNE(n_components=3, random_state=42)
tsne_3d_embeddings_values = tsne_model_3d.fit_transform(embeddings_array)

fig = px.scatter_3d(
x = tsne_3d_embeddings_values[:,0],
y = tsne_3d_embeddings_values[:,1],
z = tsne_3d_embeddings_values[:,2],
color = df.topic.values,
hover_name = df.full_text.values,
title = 't-SNE embeddings', width = 800, height = 600,
color_discrete_sequence = plotly.colours.qualitative.Alphabet_r,
opacity = 0.7
)
fig.update_layout(xaxis_title = 'first component', yaxis_title = 'second component')
fig.show()

Barcodes

The solution to understand the embeddings is to visualise a few them as bar codes and see the correlations. I picked three examples of embeddings: two are closest to one another, and the opposite is the farthest example in our dataset.

embedding1 = df.loc[1].embedding
embedding2 = df.loc[616].embedding
embedding3 = df.loc[749].embedding
import seaborn as sns
import matplotlib.pyplot as plt
embed_len_thr = 1536

sns.heatmap(np.array(embedding1[:embed_len_thr]).reshape(-1, embed_len_thr),
cmap = "Greys", center = 0, square = False,
xticklabels = False, cbar = False)
plt.gcf().set_size_inches(15,1)
plt.yticks([0.5], labels = ['AI'])
plt.show()

sns.heatmap(np.array(embedding3[:embed_len_thr]).reshape(-1, embed_len_thr),
cmap = "Greys", center = 0, square = False,
xticklabels = False, cbar = False)
plt.gcf().set_size_inches(15,1)
plt.yticks([0.5], labels = ['AI'])
plt.show()

sns.heatmap(np.array(embedding2[:embed_len_thr]).reshape(-1, embed_len_thr),
cmap = "Greys", center = 0, square = False,
xticklabels = False, cbar = False)
plt.gcf().set_size_inches(15,1)
plt.yticks([0.5], labels = ['Bioinformatics'])
plt.show()

Graph by writer

It’s tough to see whether vectors are close to one another in our case due to high dimensionality. Nevertheless, I still like this visualisation. It may be helpful in some cases, so I’m sharing this concept with you.

We’ve learned how you can visualise embeddings and haven’t any doubts left about their ability to understand the meaning of the text. Now, it’s time to maneuver on to essentially the most interesting and engaging part and discuss how you may leverage embeddings in practice.

After all, embeddings’ primary goal just isn’t to encode texts as vectors of numbers or visualise them only for the sake of it. We will profit loads from our ability to capture the texts’ meanings. Let’s undergo a bunch of more practical examples.

Clustering

Let’s start with clustering. Clustering is an unsupervised learning technique that means that you can split your data into groups with none initial labels. Clustering can allow you to understand the inner structural patterns in your data.

We’ll use some of the basic clustering algorithms — K-means. For the K-means algorithm, we want to specify the variety of clusters. We will define the optimal variety of clusters using silhouette scores.

Let’s try k (variety of clusters) between 2 and 50. For every k, we are going to train a model and calculate silhouette scores. The upper silhouette rating — the higher clustering we got.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import tqdm

silhouette_scores = []
for k in tqdm.tqdm(range(2, 51)):
kmeans = KMeans(n_clusters=k,
random_state=42,
n_init = 'auto').fit(embeddings_array)
kmeans_labels = kmeans.labels_
silhouette_scores.append(
{
'k': k,
'silhouette_score': silhouette_score(embeddings_array,
kmeans_labels, metric = 'cosine')
}
)

fig = px.line(pd.DataFrame(silhouette_scores).set_index('k'),
title = 'Silhouette scores for K-means clustering',
labels = {'value': 'silhoutte rating'},
color_discrete_sequence = plotly.colours.qualitative.Alphabet)
fig.update_layout(showlegend = False)

In our case, the silhouette rating reaches a maximum when k = 11. So, let’s use this variety of clusters for our final model.

Graph by writer

Let’s visualise the clusters using t-SNE for dimensionality reduction as we already did before.

tsne_model = TSNE(n_components=2, random_state=42)
tsne_embeddings_values = tsne_model.fit_transform(embeddings_array)

fig = px.scatter(
x = tsne_embeddings_values[:,0],
y = tsne_embeddings_values[:,1],
color = list(map(lambda x: 'cluster %s' % x, kmeans_labels)),
hover_name = df.full_text.values,
title = 't-SNE embeddings for clustering', width = 800, height = 600,
color_discrete_sequence = plotly.colours.qualitative.Alphabet_r
)
fig.update_layout(
xaxis_title = 'first component',
yaxis_title = 'second component')
fig.show()

Visually, we will see that the algorithm was capable of define clusters quite well — they’re separated pretty much.

We’ve factual topic labels, so we will even assess how good clusterisation is. Let’s take a look at the topics’ mixture for every cluster.

df['cluster'] = list(map(lambda x: 'cluster %s' % x, kmeans_labels))
cluster_stats_df = df.reset_index().pivot_table(
index = 'cluster', values = 'id',
aggfunc = 'count', columns = 'topic').fillna(0).applymap(int)

cluster_stats_df = cluster_stats_df.apply(
lambda x: 100*x/cluster_stats_df.sum(axis = 1))

fig = px.imshow(
cluster_stats_df.values,
x = cluster_stats_df.columns,
y = cluster_stats_df.index,
text_auto = '.2f', aspect = "auto",
labels=dict(x="cluster", y="fact topic", color="share, %"),
color_continuous_scale='pubugn',
title = 'Share of topics in each cluster', height = 550)

fig.show()

Typically, clusterisation worked perfectly. For instance, cluster 5 incorporates almost only questions on bicycles, while cluster 6 is about coffee. Nevertheless, it wasn’t able to differentiate close topics:

  • “ai”, “genai” and “datascience” are multi function cluster,
  • the identical store with “economics” and “politics”.

We used only embeddings because the features in this instance, but when you’ve any additional information (for instance, age, gender or country of the user who asked the query), you may include it within the model, too.

Classification

We will use embeddings for classification or regression tasks. For instance, you may do it to predict customer reviews’ sentiment (classification) or NPS rating (regression).

Since classification and regression are supervised learning, you will have to have labels. Luckily, we all know the topics for our questions and might fit a model to predict them.

I’ll use a Random Forest Classifier. In case you need a fast refresher about Random Forests, you will discover it here. To evaluate the classification model’s performance accurately, we are going to split our dataset into train and test sets (80% vs 20%). Then, we will train our model on a train set and measure the standard on a test set (questions that the model hasn’t seen before).

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
class_model = RandomForestClassifier(max_depth = 10)

# defining features and goal
X = embeddings_array
y = df.topic

# splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state = 42, test_size=0.2, stratify=y
)

# fit & predict
class_model.fit(X_train, y_train)
y_pred = class_model.predict(X_test)

To estimate the model’s performance, let’s calculate a confusion matrix. In an excellent situation, all non-diagonal elements must be 0.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

fig = px.imshow(
cm, x = class_model.classes_,
y = class_model.classes_, text_auto='d',
aspect="auto",
labels=dict(
x="predicted label", y="true label",
color="cases"),
color_continuous_scale='pubugn',
title = 'Confusion matrix', height = 550)

fig.show()

We will see similar results to clusterisation: some topics are easy to categorise, and accuracy is 100%, for instance, “bicycles” or “travel”, while some others are difficult to differentiate (especially “ai”).

Nevertheless, we achieved 91.8% overall accuracy, which is sort of good.

Finding anomalies

We may use embedding to search out anomalies in our data. For instance, on the t-SNE graph, we saw that some questions are pretty removed from their clusters, for example, for the “travel” topic. Let’s take a look at this theme and take a look at to search out anomalies. We’ll use the Isolation Forest algorithm for it.

from sklearn.ensemble import IsolationForest

topic_df = df[df.topic == 'travel']
topic_embeddings_array = np.array(topic_df.embedding.values.tolist())

clf = IsolationForest(contamination = 0.03, random_state = 42)
topic_df['is_anomaly'] = clf.fit_predict(topic_embeddings_array)

topic_df[topic_df.is_anomaly == -1][['full_text']]

So, here we’re. We’ve found essentially the most unusual comment for the travel topic (source).

Is it protected to drink the water from the fountains found throughout 
the older parts of Rome?

After I visited Rome and walked across the older sections, I saw many
various kinds of fountains that were continuously running with water.
Some went into the bottom, some collected in basins, etc.

Is the water coming out of those fountains potable? Protected for visitors
to drink from? Any etiquette regarding their use that a visitor
should find out about?

Because it talks about water, the embedding of this comment is near the coffee topic where people also discuss water to pour coffee. So, the embedding representation is sort of reasonable.

We could find it on our t-SNE visualisation and see that it’s actually near the coffee cluster.

Graph by writer

RAG — Retrieval Augmented Generation

With the recently increased popularity of LLMs, embeddings have been broadly utilized in RAG use cases.

We’d like Retrieval Augmented Generation when we’ve a whole lot of documents (for instance, all of the questions from Stack Exchange), and we will’t pass all of them to an LLM because

  • LLMs have limits on the context size (immediately, it’s 128K for GPT-4 Turbo).
  • We pay for tokens, so it’s costlier to pass all the data on a regular basis.
  • LLMs show worse performance with a much bigger context. You possibly can check Needle In A Haystack — Pressure Testing LLMs to learn more details.

To give you the option to work with an intensive knowledge base, we will leverage the RAG approach:

  • Compute embeddings for all of the documents and store them in vector storage.
  • Once we get a user request, we will calculate its embedding and retrieve relevant documents from the storage for this request.
  • Pass only relevant documents to LLM to get a final answer.

To learn more about RAG, don’t hesitate to read my article with rather more details here.

In this text, we’ve discussed text embeddings in much detail. Hopefully, now you’ve a whole and deep understanding of this topic. Here’s a fast recap of our journey:

  • Firstly, we went through the evolution of approaches to work with texts.
  • Then, we discussed how you can understand whether texts have similar meanings to one another.
  • After that, we saw different approaches to text embedding visualisation.
  • Finally, we tried to make use of embeddings as features in numerous practical tasks comparable to clustering, classification, anomaly detection and RAG.

Thanks loads for reading this text. If you’ve any follow-up questions or comments, please leave them within the comments section.

In this text, I used a dataset from Stack Exchange Data Dump, which is obtainable under the Creative Commons license.

This text was inspired by the next courses:

LEAVE A REPLY

Please enter your comment!
Please enter your name here