Home Artificial Intelligence Getting Began with Weaviate: A Beginner’s Guide to Search with Vector Databases What’s Weaviate? Prerequisites Setup Easy methods to Create and Populate a Weaviate Vector Database Easy methods to Query the Weaviate Vector Database Summary Enjoyed This Story? References

Getting Began with Weaviate: A Beginner’s Guide to Search with Vector Databases What’s Weaviate? Prerequisites Setup Easy methods to Create and Populate a Weaviate Vector Database Easy methods to Query the Weaviate Vector Database Summary Enjoyed This Story? References

0
Getting Began with Weaviate: A Beginner’s Guide to Search with Vector Databases
What’s Weaviate?
Prerequisites
Setup
Easy methods to Create and Populate a Weaviate Vector Database
Easy methods to Query the Weaviate Vector Database
Summary
Enjoyed This Story?
References

Easy methods to use vector databases for semantic search, query answering, and generative search in Python with OpenAI and Weaviate

Towards Data Science

In case you landed on this text, I assume you might have been fooling around with constructing an app with a big language model (LLM) and got here across the term vector database.

The tool landscape around constructing apps with LLMs is growing rapidly, with tools resembling LangChain or LlamaIndex gaining popularity.

In a recent article, I described tips on how to start with LangChain, and in this text, I need to proceed exploring the LLM tool landscape by fooling around with Weaviate.

Weaviate is an open-source vector database. It lets you store data objects and vector embeddings and query them based on similarity measures.

Vector databases have been getting much attention because the rise of media attention on LLMs. Probably the most well-liked use case of vector databases within the context of LLMs is to “provide LLMs with long-term memory”.

In case you need a refresher on the concept of vector databases, it is advisable to have a take a look at my previous article:

On this tutorial, we’ll walk through tips on how to populate a Weaviate vector database with embeddings of your dataset. Then we’ll go over three other ways you possibly can retrieve information from it:

To follow along on this tutorial, you have to to have the next:

  • Python 3 environment
  • OpenAI API key (or alternatively, an API key for Hugging Face, Cohere, or PaLM)

A note on the API key: On this tutorial, we’ll generate embeddings from text via an inference service (on this case, OpenAI). Depending on which inference service you utilize, be sure that to ascertain the provider’s pricing page to avoid unexpected costs. E.g., the used Ada model (version 2) costs $0.0001 per 1,000 tokens on the time of writing and resulted in lower than 1 cent in inference costs for this tutorial.

You’ll be able to run Weaviate either on your personal instances (using Docker, Kubernetes, or Embedded Weaviate) or as a managed service using Weaviate Cloud Services (WCS). For this tutorial, we’ll run a Weaviate instance with WCS, as that is the really useful and most straightforward way.

Easy methods to Create a Cluster with Weaviate Cloud Services (WCS)

To have the option to make use of the service, you first have to register with WCS.

Once you’re registered, you possibly can create a brand new Weaviate Cluster by clicking the “Create cluster” button.

Screenshot of Weaviate Cloud Services

For this tutorial, we will likely be using the free trial plan, which is able to offer you a sandbox for 14 days. (You won’t should add any payment information. As an alternative, the sandbox simply expires after the trial period. But you possibly can create a brand new free trial sandbox anytime.)

Under the “Free sandbox” tab, make the next settings:

  1. Enter a cluster name
  2. Enable Authentication (set to “YES”)
Screenshot of Weaviate Cloud Services plans

Finally, click “Create” to create your sandbox instance.

Easy methods to Install Weaviate in Python

Last but not least, add the weaviate-client to your Python environment with pip

$ pip install weaviate-client

and import the library:

import weaviate

How To Access a Weaviate Cluster Through a Client

For the subsequent step, you have to the next two pieces of knowledge to access your cluster:

  • The cluster URL
  • Weaviate API key (under “Enabled — Authentication”)
Screenshot of Weaviate Cloud Services sandbox

Now, you possibly can instantiate a Weaviate client to access your Weaviate cluster as follows.

auth_config = weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY")  # Replace w/ your Weaviate instance API key

# Instantiate the client
client = weaviate.Client(
url="https://.weaviate.network", # Replace w/ your Weaviate cluster URL
auth_client_secret=auth_config,
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY", # Replace together with your OpenAI key
}
)

As you possibly can see, we’re using the OpenAI API key under additional_headers to access the embedding model later. In case you are using a special provider than OpenAI, change the important thing parameter to one in every of the next that apply: X-Cohere-Api-Key, X-HuggingFace-Api-Key, or X-Palm-Api-Key.

To ascertain if all the pieces is about up accurately, run:

client.is_ready()

If it returns True, you’re all set for the subsequent steps.

Now, we’re able to create a vector database in Weaviate and populate it with some data.

For this tutorial, we’ll use the primary 100 rows of the 200.000+ Jeopardy Questions dataset [1] from Kaggle.

import pandas as pd

df = pd.read_csv("your_file_path.csv", nrows = 100)

First few rows of the 200.000+ Jeopardy Questions dataset [1] from Kaggle.

A note on the variety of tokens and related costs: In the next example, we’ll embed the columns “category”, “query”, and “answer” for the primary 100 rows. Based on a calculation with the tiktoken library, this can lead to roughly 3,000 tokens to embed, which roughly ends in $0.0003 inference costs with OpenAI’s Ada model (version 2) as of July 2023.

Step 1: Create a Schema

First, we want to define the underlying data structure and a few configurations:

  • class: What’s going to the gathering of objects on this vector space be called?
  • properties: The properties of an object, including the property name and data type. Within the Pandas Dataframe analogy, these can be the columns within the DataFrame.
  • vectorizer: The model that generates the embeddings. For text objects, you’ll typically select one in every of the text2vec modules (text2vec-cohere, text2vec-huggingface, text2vec-openai, or text2vec-palm) in keeping with the provider you’re using.
  • moduleConfig: Here, you possibly can define the small print of the used modules. E.g., the vectorizer is a module for which you’ll be able to define which model and version to make use of.
class_obj = {
# Class definition
"class": "JeopardyQuestion",

# Property definitions
"properties": [
{
"name": "category",
"dataType": ["text"],
},
{
"name": "query",
"dataType": ["text"],
},
{
"name": "answer",
"dataType": ["text"],
},
],

# Specify a vectorizer
"vectorizer": "text2vec-openai",

# Module settings
"moduleConfig": {
"text2vec-openai": {
"vectorizeClassName": False,
"model": "ada",
"modelVersion": "002",
"type": "text"
},
},
}

Within the above schema, you possibly can see that we are going to create a category called "JeopardyQuestion", with the three text properties "category", "query", and "answer". The vectorizer we’re using is OpenAI’s Ada model (version 2). All properties will likely be vectorized but not the category name ("vectorizeClassName" : False). If you might have properties you don’t wish to embed, you possibly can specify this (see the docs).

Once you might have defined the schema, you possibly can create the category with the create_class() method.

client.schema.create_class(class_obj)

To ascertain if the category has been created successfully, you possibly can review its schema as follows:

client.schema.get("JeopardyQuestion")

The created schema looks as shown below:

{
"class": "JeopardyQuestion",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": false
}
},
"properties": [
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "category",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "query",
"tokenization": "word"
},
{
"dataType": [
"text"
],
"indexFilterable": true,
"indexSearchable": true,
"moduleConfig": {
"text2vec-openai": {
"skip": false,
"vectorizePropertyName": false
}
},
"name": "answer",
"tokenization": "word"
}
],
"replicationConfig": {
"factor": 1
},
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 1000000000000,
"flatSearchCutoff": 40000,
"distance": "cosine",
"pq": {
"enabled": false,
"bitCompression": false,
"segments": 0,
"centroids": 256,
"encoder": {
"type": "kmeans",
"distribution": "log-normal"
}
}
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-openai"
}

Step 2: Import data into Weaviate

At this stage, the vector database has a schema but remains to be empty. So, let’s populate it with our dataset. This process can also be called “upserting”.

We are going to upsert the info in batches of 200. In case you paid attention, you already know this isn’t mandatory here because we only have 100 rows of information. But once you’re able to upsert larger amounts of information, it would be best to do that in batches. That’s why I’ll leave the code for batching here:

from weaviate.util import generate_uuid5

with client.batch(
batch_size=200, # Specify batch size
num_workers=2, # Parallelize the method
) as batch:
for _, row in df.iterrows():
question_object = {
"category": row.category,
"query": row.query,
"answer": row.answer,
}
batch.add_data_object(
question_object,
class_name="JeopardyQuestion",
uuid=generate_uuid5(question_object)
)

Although, Weaviate will generate a universally unique identifier (uuid) robotically, we’ll manually generate the uuid with the generate_uuid5() function from the question_object to avoid importing duplicate items.

For a sanity check, you possibly can review the variety of imported objects with the next code snippet:

client.query.aggregate("JeopardyQuestion").with_meta_count().do()
{'data': {'Aggregate': {'JeopardyQuestion': [{'meta': {'count': 100}}]}}}

Essentially the most common operation you’ll do with a vector database is to retrieve objects. To retrieve objects, you query the Weaviate vector database with the get() function:

client.query.get(
,
[]
)..do()
  • Class: specifies the name of the category of objects to be retrieved. Here: "JeopardyQuestion"
  • properties: specifies the properties of the objects to be retrieved. Here: a number of of "category", "query", and "answer".
  • arguments: specifies the search criteria to retrieve the objects, resembling limits or aggregations. We are going to cover a few of these in the next examples.

LEAVE A REPLY

Please enter your comment!
Please enter your name here