Home Artificial Intelligence Semantic Search with PostgreSQL and OpenAI Embeddings

Semantic Search with PostgreSQL and OpenAI Embeddings

0
Semantic Search with PostgreSQL and OpenAI Embeddings

Towards Data Science
Image by Igor Omilaev on Unsplash

Implementing semantic search inside corporate databases may be difficult and requires significant effort. Nonetheless, does it need to be this fashion? In this text, I exhibit how you may utilize PostgreSQL together with OpenAI Embeddings to implement semantic search in your data. In the event you prefer not to make use of OpenAI Embeddings API, I’ll give you links to free embedding models.

On a really high level, vector databases with LLMs allow to do semantic search on available data (stored in databases, documents, etc.) Thank to the “Efficient Estimation of Word Representations in Vector Space” paper (also generally known as “Word2Vec Paper”) co-authored by legendary Jeff Dean, we all know how one can represent words as real-valued vectors. Word embeddings are dense vector representations of words in a vector space where words with similar meanings are closer to one another. Word embeddings capture semantic relationships between words and there are a couple of technique to create them.

Image by the creator

Let’s practice and use OpenAI’s text-embedding-ada model! The alternative of distance function typically doesn’t matter much. OpenAI recommends cosine similarity. In the event you don’t wish to use OpenAI embeddings and like running a unique model locally as an alternative of constructing API calls, I suggest considering one in every of the SentenceTransformers pretrained models. Select your model correctly.

import os

import openai
from openai.embeddings_utils import cosine_similarity

openai.api_key = os.getenv("OPENAI_API_KEY")

def get_embedding(text: str) -> list:
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
return response['data'][0]['embedding']

good_ride = "good ride"
good_ride_embedding = get_embedding(good_ride)
print(good_ride_embedding)
# [0.0010935445316135883, -0.01159335020929575, 0.014949149452149868, -0.029251709580421448, -0.022591838613152504, 0.006514389533549547, -0.014793967828154564, -0.048364896327257156, -0.006336577236652374, -0.027027441188693047, ...]
len(good_ride_embedding)
# 1536

Now that we now have developed an understanding of what an embedding is, let’s put it to use to sort some reviews.

good_ride_review_1 = "I actually enjoyed the trip! The ride was incredibly smooth, the pick-up location was convenient, and the drop-off point was right in front of the coffee shop."
good_ride_review_1_embedding = get_embedding(good_ride_review_1)
cosine_similarity(good_ride_review_1_embedding, good_ride_embedding)
# 0.8300454513797334

good_ride_review_2 = "The drive was exceptionally comfortable. I felt secure throughout the journey and greatly appreciated the on-board entertainment, which allowed me to have some fun while the automobile was in motion."
good_ride_review_2_embedding = get_embedding(good_ride_review_2)
cosine_similarity(good_ride_review_2_embedding, good_ride_embedding)
# 0.821774476808789

bad_ride_review = "A sudden hard brake on the intersection really caught me off guard and stressed me out. I wasn't prepared for it. Moreover, I noticed some trash left within the cabin from a previous rider."
bad_ride_review_embedding = get_embedding(bad_ride_review)
cosine_similarity(bad_ride_review_embedding, good_ride_embedding)
# 0.7950041130579355

While absolutely the difference may appear small, consider a sorting function with hundreds and hundreds of reviews. In such cases, we are able to prioritize highlighting only the positive ones at the highest.

Once a word or a document has been transformed into an embedding, it could possibly be stored in a database. This motion, nevertheless, doesn’t mechanically classify the database as a vector database. It’s only when the database begins to support fast operations on the vector that we are able to rightfully label it as a vector database.

There are many industrial and open-source vector databases, making it a highly discussed topic. I’ll exhibit the functioning of vector databases using a pgvector, an open-source PostgreSQL extension that allows vector similarity search functionalities for arguably the preferred database.

Let’s run the PostgreSQL container with pgvector:

docker pull ankane/pgvector

docker run --env "POSTGRES_PASSWORD=postgres" --name "postgres-with-pgvector" --publish 5432:5432 --detach ankane/pgvector

Let’s start pgcli to hook up with the database (pgcli postgres://postgres:postgres@localhost:5432) and create a table, insert the embeddings we computed above, after which select similar items:

-- Enable pgvector extension.
CREATE EXTENSION vector;

-- Create a vector column with 1536 dimensions.
-- The `text-embedding-ada-002` model has 1536 dimensions.
CREATE TABLE reviews (text TEXT, embedding vector(1536));

-- Insert three reviews from the above. I omitted the input to your convinience.
INSERT INTO reviews (text, embedding) VALUES ('I actually enjoyed the trip! The ride was incredibly smooth, the pick-up location was convenient, and the drop-off point was right in front of the coffee shop.', '[-0.00533589581027627, -0.01026702206581831, 0.021472081542015076, -0.04132508486509323, ...');
INSERT INTO reviews (text, embedding) VALUES ('The drive was exceptionally comfortable. I felt secure throughout the journey and greatly appreciated the on-board entertainment, which allowed me to have some fun while the automobile was in motion.', '[0.0001858668401837349, -0.004922827705740929, 0.012813017703592777, -0.041855424642562866, ...');
INSERT INTO reviews (text, embedding) VALUES ('A sudden hard brake on the intersection really caught me off guard and stressed me out. I used to be not prepared for it. Moreover, I noticed some trash left within the cabin from a previous rider.', '[0.00191772251855582, -0.004589076619595289, 0.004269456025213003, -0.0225954819470644, ...');

-- sanity check
select count(1) from reviews;
-- +-------+
-- | count |
-- |-------|
-- | 3 |
-- +-------+

We’re prepared to look for similar documents now. I even have shortened the embedding for “good ride” again because printing 1536 dimensions is excessive.

--- The embedding we use here is for "good ride"
SELECT substring(text, 0, 80) FROM reviews ORDER BY embedding <-> '[0.0010935445316135883, -0.01159335020929575, 0.014949149452149868, -0.029251709580421448, ...';

-- +--------------------------------------------------------------------------+
-- | substring |
-- |--------------------------------------------------------------------------|
-- | I actually enjoyed the trip! The ride was incredibly smooth, the pick-u... |
-- | The drive was exceptionally comfortable. I felt secure throughout the... |
-- | A sudden hard brake on the intersection really caught me off guard an... |
-- +--------------------------------------------------------------------------+
SELECT 3
Time: 0.024s

Accomplished! As you may observe, we now have computed embeddings for multiple documents, stored them within the database, and conducted vector similarity searches. The potential applications are vast, starting from corporate searches to features in medical record systems for identifying patients with similar symptoms. Moreover, this method isn’t restricted to texts; similarity can be calculated for other sorts of data reminiscent of sound, video, and pictures.

Enjoy!

LEAVE A REPLY

Please enter your comment!
Please enter your name here