Home Artificial Intelligence A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch The issue with learning in a fast paced space Introducing our concept: Retrieval Augmented Generation The High Level Components of our RAG System The ordered steps of a querying RAG system A note from the paper itself Working through an example — the best RAG system Getting a set of documents Defining and performing the similarity measure Adding in a LLM Areas for improvement References

A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch The issue with learning in a fast paced space Introducing our concept: Retrieval Augmented Generation The High Level Components of our RAG System The ordered steps of a querying RAG system A note from the paper itself Working through an example — the best RAG system Getting a set of documents Defining and performing the similarity measure Adding in a LLM Areas for improvement References

0
A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch
A beginner’s guide to constructing a Retrieval Augmented Generation (RAG) application from scratch
The issue with learning in a fast paced space
Introducing our concept: Retrieval Augmented Generation
The High Level Components of our RAG System
The ordered steps of a querying RAG system
A note from the paper itself
Working through an example — the best RAG system
Getting a set of documents
Defining and performing the similarity measure
Adding in a LLM
Areas for improvement
References

Learn critical knowledge for constructing AI apps, in plain english

Towards Data Science

Retrieval Augmented Generation, or RAG, is all the fad nowadays since it introduces some serious capabilities to large language models like OpenAI’s GPT-4 — and that’s the flexibility to make use of and leverage their very own data.

This post will teach you the basic intuition behind RAG while providing an easy tutorial to assist you to start.

There’s a lot noise within the AI space and particularly about RAG. Vendors are attempting to overcomplicate it. They’re attempting to inject their tools, their ecosystems, their vision.

It’s making RAG far more complicated than it must be. This tutorial is designed to assist beginners learn easy methods to construct RAG applications from scratch. No fluff, no (okay, minimal) jargon, no libraries, just an easy step-by-step RAG application.

Jerry from LlamaIndex advocates for constructing things from scratch to actually understand the pieces. When you do, using a library like LlamaIndex makes more sense.

Construct from scratch to learn, then construct with libraries to scale.

Let’s start!

It’s possible you’ll or may not have heard of Retrieval Augmented Generation or RAG.

Here’s the definition from the blog post introducing the concept from Facebook:

Constructing a model that researches and contextualizes is tougher, but it surely’s essential for future advancements. We recently made substantial progress on this realm with our Retrieval Augmented Generation (RAG) architecture, an end-to-end differentiable model that mixes an information retrieval component (Facebook AI’s dense-passage retrieval system) with a seq2seq generator (our Bidirectional and Auto-Regressive Transformers [BART] model). RAG may be fine-tuned on knowledge-intensive downstream tasks to realize state-of-the-art results compared with even the biggest pretrained seq2seq language models. And in contrast to these pretrained models, RAG’s internal knowledge may be easily altered and even supplemented on the fly, enabling researchers and engineers to manage what RAG knows and doesn’t know without wasting time or compute power retraining all the model.

Wow, that’s a mouthful.

In simplifying the technique for beginners, we are able to state that the essence of RAG involves adding your personal data (via a retrieval tool) to the prompt that you just pass right into a large language model. Consequently, you get an output. That provides you many advantages:

  1. You’ll be able to include facts within the prompt to assist the LLM avoid hallucinations
  2. You’ll be able to (manually) consult with sources of truth when responding to a user query, helping to double check any potential issues.
  3. You’ll be able to leverage data that the LLM won’t have been trained on.
  1. a set of documents (formally called a corpus)
  2. An input from the user
  3. a similarity measure between the gathering of documents and the user input

Yes, it’s that straightforward.

To start out learning and understanding RAG based systems, you don’t need a vector store, you don’t even need an LLM (at the least to learn and understand conceptually).

While it is commonly portrayed as complicated, it doesn’t should be.

We’ll perform the next steps in sequence.

  1. Receive a user input
  2. Perform our similarity measure
  3. Post-process the user input and the fetched document(s).

The post-processing is completed with an LLM.

The actual RAG paper is clearly the resource. The issue is that it assumes a LOT of context. It’s more complicated than we’d like it to be.

As an illustration, here’s the overview of the RAG system as proposed within the paper.

An outline of RAG from the RAG paper by Lewis, et al

That’s dense.

It’s great for researchers but for the remaining of us, it’s going to be loads easier to learn step-by-step by constructing the system ourselves.

Let’s get back to constructing RAG from scratch, step-by-step. Here’s the simplified steps that we’ll be working through. While this isn’t technically “RAG” it’s a superb simplified model to learn with and permit us to progress to more complicated variations.

Below you possibly can see that we’ve got an easy corpus of ‘documents’ (please be generous 😉).

corpus_of_documents = [
"Take a leisurely walk in the park and enjoy the fresh air.",
"Visit a local museum and discover something new.",
"Attend a live music concert and feel the rhythm.",
"Go for a hike and admire the natural scenery.",
"Have a picnic with friends and share some laughs.",
"Explore a new cuisine by dining at an ethnic restaurant.",
"Take a yoga class and stretch your body and mind.",
"Join a local sports league and enjoy some friendly competition.",
"Attend a workshop or lecture on a topic you're interested in.",
"Visit an amusement park and ride the roller coasters."
]

Now we’d like a way of measuring the similarity between the user input we’re going to receive and the collection of documents that we organized. Arguably the best similarity measure is jaccard similarity. I’ve written about that prior to now (see this post however the short answer is that the jaccard similarity is the intersection divided by the union of the “sets” of words.

This permits us to check our user input with the source documents.

Side note: preprocessing

A challenge is that if we’ve a plain string like "Take a leisurely walk within the park and luxuriate in the fresh air.",, we will should pre-process that right into a set, in order that we are able to perform these comparisons. We’ll do that within the easiest method possible, lower case and split by " ".

def jaccard_similarity(query, document):
query = query.lower().split(" ")
document = document.lower().split(" ")
intersection = set(query).intersection(set(document))
union = set(query).union(set(document))
return len(intersection)/len(union)

Now we’d like to define a function that takes in the precise query and our corpus and selects the ‘best’ document to return to the user.

def return_response(query, corpus):
similarities = []
for doc in corpus:
similarity = jaccard_similarity(query, doc)
similarities.append(similarity)
return corpus_of_documents[similarities.index(max(similarities))]

Now we are able to run it, we’ll start with an easy prompt.

user_prompt = "What's a leisure activity that you just like?"

And an easy user input…

user_input = "I prefer to hike"

Now we are able to return our response.

return_response(user_input, corpus_of_documents)
'Go for a hike and admire the natural scenery.'

Congratulations, you’ve built a basic RAG application.

I got 99 problems and bad similarity is one

Now we’ve opted for an easy similarity measure for learning. But that is going to be problematic since it’s so easy. It has no notion of semantics. It’s just looks at what words are in each documents. That implies that if we offer a negative example, we’re going to get the identical “result” because that’s the closest document.

user_input = "I don't love to hike"
return_response(user_input, corpus_of_documents)
'Go for a hike and admire the natural scenery.'

This can be a topic that’s going to come back up loads with “RAG”, but for now, rest assured that we’ll address this problem later.

At this point, we’ve not done any post-processing of the “document” to which we’re responding. Thus far, we’ve implemented only the “retrieval” a part of “Retrieval-Augmented Generation”. The following step is to reinforce generation by incorporating a big language model (LLM).

To do that, we’re going to make use of ollama to stand up and running with an open source LLM on our local machine. We could just as easily use OpenAI’s gpt-4 or Anthropic’s Claude but for now, we’ll start with the open source llama2 from Meta AI.

This post goes to assume some basic knowledge of huge language models, so let’s get right to querying this model.

import requests
import json

First we’re going to define the inputs. To work with this model, we’re going to take

  1. user input,
  2. fetch essentially the most similar document (as measured by our similarity measure),
  3. pass that right into a prompt to the language model,
  4. then return the result to the user

That introduces a brand new term, the prompt. Briefly, it’s the instructions that you just provide to the LLM.

While you run this code, you’ll see the streaming result. Streaming is vital for user experience.

user_input = "I prefer to hike"
relevant_document = return_response(user_input, corpus_of_documents)
full_response = []
prompt = """
You're a bot that makes recommendations for activities. You answer in very short sentences and don't include extra information.
That is the really helpful activity: {relevant_document}
The user input is: {user_input}
Compile a suggestion to the user based on the really helpful activity and the user input.
"""

Having defined that, let’s now make the API call to ollama (and llama2). a very important step is to ensure that ollama’s running already in your local machine by running ollama serve.

Note: this is likely to be slow in your machine, it’s actually slow on mine. Be patient, young grasshopper.

url = 'http://localhost:11434/api/generate'
data = {
"model": "llama2",
"prompt": prompt.format(user_input=user_input, relevant_document=relevant_document)
}
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
try:
count = 0
for line in response.iter_lines():
# filter out keep-alive recent lines
# count += 1
# if count % 5== 0:
# print(decoded_line['response']) # print every fifth token
if line:
decoded_line = json.loads(line.decode('utf-8'))

full_response.append(decoded_line['response'])
finally:
response.close()
print(''.join(full_response))

Great! Based in your interest in climbing, I like to recommend trying out the nearby trails for a difficult and rewarding experience with breathtaking views Great! Based in your interest in climbing, I like to recommend testing the nearby trails for a fun and difficult adventure.

This provides us an entire RAG Application, from scratch, no providers, no services. all the components in a Retrieval-Augmented Generation application. Visually, here’s what we’ve built.

The LLM (in case you’re lucky) will handle the user input that goes against the really helpful document. We are able to see that below.

user_input = "I don't love to hike"
relevant_document = return_response(user_input, corpus_of_documents)
# https://github.com/jmorganca/ollama/blob/foremost/docs/api.md
full_response = []
prompt = """
You're a bot that makes recommendations for activities. You answer in very short sentences and don't include extra information.
That is the really helpful activity: {relevant_document}
The user input is: {user_input}
Compile a suggestion to the user based on the really helpful activity and the user input.
"""
url = 'http://localhost:11434/api/generate'
data = {
"model": "llama2",
"prompt": prompt.format(user_input=user_input, relevant_document=relevant_document)
}
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
try:
for line in response.iter_lines():
# filter out keep-alive recent lines
if line:
decoded_line = json.loads(line.decode('utf-8'))
# print(decoded_line['response']) # uncomment to results, token by token
full_response.append(decoded_line['response'])
finally:
response.close()
print(''.join(full_response))
Sure, here is my response:

Try kayaking as a substitute! It's an important approach to enjoy nature without having to hike.

If we return to our diagream of the RAG application and take into consideration what we’ve just built, we’ll see various opportunities for improvement. These opportunities are where tools like vector stores, embeddings, and prompt ‘engineering’ gets involved.

Listed here are ten potential areas where we could improve the present setup:

  1. The variety of documents 👉 more documents might mean more recommendations.
  2. The depth/size of documents 👉 higher quality content and longer documents with more information is likely to be higher.
  3. The variety of documents we give to the LLM 👉 Immediately, we’re only giving the LLM one document. We could feed in several as ‘context’ and permit the model to offer a more personalized suggestion based on the user input.
  4. The parts of documents that we give to the LLM 👉 If we’ve greater or more thorough documents, we’d just wish to add in parts of those documents, parts of assorted documents, or some variation there of. Within the lexicon, this is named chunking.
  5. Our document storage tool 👉 We would store our documents otherwise or different database. Specifically, if we’ve quite a lot of documents, we’d explore storing them in an information lake or a vector store.
  6. The similarity measure 👉 How we measure similarity is of consequence, we’d must trade off performance and thoroughness (e.g., taking a look at every individual document).
  7. The pre-processing of the documents & user input 👉 We would perform some extra preprocessing or augmentation of the user input before we pass it into the similarity measure. As an illustration, we’d use an embedding to convert that input to a vector.
  8. The similarity measure 👉 We are able to change the similarity measure to fetch higher or more relevant documents.
  9. The model 👉 We are able to change the ultimate model that we use. We’re using llama2 above, but we could just as easily use an Anthropic or Claude Model.
  10. The prompt 👉 We could use a distinct prompt into the LLM/Model and tune it in keeping with the output we wish to get the output we wish.
  11. For those who’re fearful about harmful or toxic output 👉 We could implement a “circuit breaker” of sorts that runs the user input to see if there’s toxic, harmful, or dangerous discussions. As an illustration, in a healthcare context you might see if the knowledge contained unsafe languages and respond accordingly — outside of the standard flow.

The scope for improvements isn’t limited to those points; the probabilities are vast, and we’ll delve into them in future tutorials. Until then, don’t hesitate to reach out on Twitter if you’ve got any questions. Glad RAGING :).

This post was originally posted on learnbybuilding.ai. I’m running a course on How you can Construct Generative AI Products for Product Managers in the approaching months, enroll here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here