Integrate the newly added vector index into LangChain to reinforce your RAG applications
Because the advent of ChatGPT six months ago, the technology landscape has undergone a transformative shift. ChatGPT’s exceptional capability for generalization has diminished the requirement for specialised deep learning teams and extensive training datasets to create custom NLP models. This has democratized access to a variety of NLP tasks, comparable to summarization and data extraction, making them more available than ever before. Nonetheless, we soon realized the constraints of ChatGPT-like models, comparable to knowledge date cutoff and never accessing private information. For my part, what followed was the second wave of generative AI transformation with the rise of Retrieval Augmented Generation (RAG) applications, where you feed relevant information to the model at query time to construct higher and more accurate answers.
As mentioned, the RAG applications require a wise search tool that’s capable of retrieve additional information based on the user input, which allows the LLMs to supply more accurate and up-to-date answers. At first, the main target was totally on retrieving information from unstructured text using semantic search. Nonetheless, it soon became evident that a mixture of structured and unstructured data is the very best approach to RAG applications if you wish to move beyond “Chat along with your PDF” applications.
Neo4j was and is a superb fit for handling structured information, but it surely struggled a bit with semantic search because of its brute-force approach. Nonetheless, the struggle is prior to now as Neo4j has introduced a brand new vector index in version 5.11 designed to efficiently perform semantic search over unstructured text or other embedded data modalities. The newly added vector index makes Neo4j a terrific fit for many RAG applications because it now works great with each structured and unstructured data.
On this blog post I’ll show you easy methods to setup a vector index in Neo4j and integrate it into the LangChain ecosystem. The code is obtainable on GitHub.
Neo4j Environment setup
You want to setup a Neo4j 5.11 or greater to follow together with the examples on this blog post. The simplest way is to start out a free instance on Neo4j Aura, which offers cloud instances of Neo4j database. Alternatively, you may also setup a neighborhood instance of the Neo4j database by downloading the Neo4j Desktop application and creating a neighborhood database instance.
After you will have instantiated the Neo4j database, you should utilize the LangChain library to connect with it.
from langchain.graphs import Neo4jGraphNEO4J_URI="neo4j+s://1234.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="-"
graph = Neo4jGraph(
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD
)
Organising the Vector Index
Neo4j vector index is powered by Lucene, where Lucene implements a Hierarchical Navigable Small World (HNSW) Graph to perform a approximate nearest neighbors (ANN) query over the vector space.
Neo4j’s implementation of the vector index is designed to index a single node property of a node label. For instance, for those who desired to index nodes with the label Chunk on their node property embedding , you’d use the next Cypher procedure.
CALL db.index.vector.createNodeIndex(
'wikipedia', // index name
'Chunk', // node label
'embedding', // node property
1536, // vector size
'cosine' // similarity metric
)
Together with the index name, node label, and property, you will need to specify the vector size (embedding dimension), and the similarity metric. We will likely be using OpenAI’s text-embedding-ada-002 embedding model, which uses vector size 1536 to represent text within the embedding space. For the time being, only the cosine and Euclidean similarity metrics can be found. OpenAI suggests using the cosine similarity metric when using their embedding model.
Populating the Vector index
Neo4j is schema-less by design, which implies it doesn’t implement any restrictions what goes right into a node property. For instance, the embedding property of the Chunk node could store integers, list of integers and even strings. Let’s do this out.
WITH [1, [1,2,3], ["2","5"], [x in range(0, 1535) | toFloat(x)]] AS exampleValues
UNWIND range(0, size(exampleValues) - 1) as index
CREATE (:Chunk {embedding: exampleValues[index], index: index})
This question creates a Chunknode for every element within the list and uses the element because the embeddingproperty value. For instance, the primary Chunk node may have the embedding property value 1, the second node [1,2,3], and so forth. Neo4j doesn’t implement any rules on what you possibly can store under node properties. Nonetheless, the vector index has clear instructions concerning the form of values and their embedding dimension it should index.
We will test which values were indexed by performing a vector index search.
CALL db.index.vector.queryNodes(
'wikipedia', // index name
3, // topK neighbors to return
[x in range(0,1535) | toFloat(x) / 2] // input vector
)
YIELD node, rating
RETURN node.index AS index, rating
In case you run this question, you’ll get only a single node returned, regardless that you requested the highest 3 neighbors to be returned. Why is that so? The vector index only indexes property values, where the worth is an inventory of floats with the desired size. In this instance, just one embeddingproperty value had the list of floats type with the chosen length 1536.
A node is indexed by the vector index if all the next are true:
- The node comprises the configured label.
- The node comprises the configured property key.
- The respective property value is of type
LIST. - The
size()of the respective value is similar because the configured dimensionality. - The worth is a sound vector for the configured similarity function.
Integrating the vector index into the LangChain ecosystem
Now we’ll implement a straightforward custom LangChain class that can use the Neo4j Vector index to retrieve relevant information to generate accurate and up-to-date answers. But first, we’ve got to populate the vector index.
The duty will consist of the next steps:
- Retrieve a Wikipedia article
- Chunk the text
- Store the text together with its vector representation in Neo4j
- Implement a custom LangChain class to support RAG applications
In this instance, we’ll fetch only a single Wikipedia article. I even have decided to make use of Baldur’s Gate 3 page.
import wikipedia
bg3 = wikipedia.page(pageid=60979422)
Next, we’d like to chunk and embed the text. We’ll split the text by section using the double newline delimiter after which use OpenAI’s embedding model to represent each section with an appropriate vector.
import os
from langchain.embeddings import OpenAIEmbeddingsos.environ["OPENAI_API_KEY"] = "API_KEY"
embeddings = OpenAIEmbeddings()
chunks = [{'text':el, 'embedding': embeddings.embed_query(el)} for
el in bg3.content.split("nn") if len(el) > 50]
Before we move on to the LangChain class, we’d like to import the text chunks into Neo4j.
graph.query("""
UNWIND $data AS row
CREATE (c:Chunk {text: row.text})
WITH c, row
CALL db.create.setVectorProperty(c, 'embedding', row.embedding)
YIELD node
RETURN distinct 'done'
""", {'data': chunks})
One thing you possibly can notice is that I used the db.create.setVectorProperty procedure to store the vectors to Neo4j. This procedure is used to confirm that the property value is indeed an inventory of floats. Moreover, it has the additional benefit of reducing the space for storing of vector property by roughly 50%. Subsequently, it is suggested at all times to make use of this procedure to store vectors to Neo4j.
Now we are able to go and implement the custom LangChain class used to retrieve information from Neo4j vector index and use it to generate answers. First, we’ll define the Cypher statement used to retrieve information.
vector_search = """
WITH $embedding AS e
CALL db.index.vector.queryNodes('wikipedia',3, e) yield node, rating
RETURN node.text AS result
ORDER BY rating DESC
LIMIT 3
"""
As you possibly can see, I even have hardcoded the index name and the k variety of neighbors to retrieve. You may make this dynamic by adding appropriate parameters for those who wish.
The custom LangChain class is implemented pretty straightforward.
class Neo4jVectorChain(Chain):
"""Chain for question-answering against a Neo4j vector index."""graph: Neo4jGraph = Field(exclude=True)
input_key: str = "query" #: :meta private:
output_key: str = "result" #: :meta private:
embeddings: OpenAIEmbeddings = OpenAIEmbeddings()
qa_chain: LLMChain = LLMChain(llm=ChatOpenAI(temperature=0), prompt=CHAT_PROMPT)
def _call(self, inputs: Dict[str, str], run_manager) -> Dict[str, Any]:
"""Embed a matter and do vector search."""
query = inputs[self.input_key]
# Embed the query
embedding = self.embeddings.embed_query(query)
# Retrieve relevant information from the vector index
context = self.graph.query(
vector_search, {'embedding': embedding})
context = [el['result'] for el in context]
# Generate the reply
result = self.qa_chain(
{"query": query, "context": context},
)
final_result = result[self.qa_chain.output_key]
return {self.output_key: final_result}
I even have omitted some boilerplate code to make it more readable. Essentially, when you possibly can call the Neo4jVectorChain, the next steps are executed:
- Embed the query using the relevant embedding model
- Use the text embedding value to retrieve most similar content from the vector index
- Use the provided context from similar content to generate the reply
We will now test our implementation.
vector_qa = Neo4jVectorChain(graph=graph, embeddings=embeddings, verbose=True)
vector_qa.run("What's the gameplay of Baldur's Gate 3 like?")
Response
Through the use of the verbose option, you may also evaluate the retrieved context from the vector index that was used to generate the reply.
Summary
Leveraging Neo4j’s recent vector indexing capabilities, you possibly can create a unified data source that powers Retrieval Augmented Generation applications effectively. This means that you can not only implement “Chat along with your PDF or documentation” solutions but additionally to conduct real-time analytics, all from a single, robust data source. This multi-purpose utility can streamline your operations and enhances data synergy, making Neo4j a terrific solution for managing each structured and unstructured data.
As at all times, the code is obtainable on GitHub.