And the way you may do the identical together with your docs
For the past six months, I’ve been working at series A startup Voxel51, a and creator of the open source computer vision toolkit FiftyOne. As a machine learning engineer and developer evangelist, my job is to take heed to our open source community and convey them what they need — latest features, integrations, tutorials, workshops, you name it.
A number of weeks ago, we added native support for vector search engines like google and yahoo and text similarity queries to FiftyOne, in order that users can find probably the most relevant images of their (often massive — containing tens of millions or tens of tens of millions of samples) datasets, via easy natural language queries.
This put us in a curious position: it was now possible for people using open source FiftyOne to readily search datasets with natural language queries, but using our documentation still required traditional keyword search.
Now we have numerous documentation, which has its pros and cons. As a user myself, I sometimes find that given the sheer quantity of documentation, finding precisely what I’m searching for requires more time than I’d like.
I used to be not going to let this fly… so I built this in my spare time:
So, here’s how I turned our docs right into a semantically searchable vector database:
You’ll find all of the code for this post within the voxel51/fiftyone-docs-search repo, and it’s easy to put in the package locally in edit mode with pip install -e .
.
Higher yet, if you would like to implement semantic seek for your personal website using this method, you may follow along! Listed here are the ingredients you’ll need:
- Install the openai Python package and create an account: you’ll use this account to send your docs and queries to an inference endpoint, which can return an embedding vector for every bit of text.
- Install the qdrant-client Python package and launch a Qdrant server via Docker: you’ll use Qdrant to create a locally hosted vector index for the docs, against which queries can be run. The Qdrant service will run inside a Docker container.
My company’s docs are all hosted as HTML documents at https://docs.voxel51.com. A natural place to begin would have been to download these docs with Python’s requests library and parse the document with Beautiful Soup.
As a developer (and creator of a lot of our docs), nonetheless, I believed I could do higher. I already had a working clone of the GitHub repository on my local computer that contained all the raw files used to generate the HTML docs. A few of our docs are written in Sphinx ReStructured Text (RST), whereas others, like tutorials, are converted to HTML from Jupyter notebooks.
I figured (mistakenly) that the closer I could get to the raw text of the RST and Jupyter files, the simpler things can be.
RST
In RST documents, sections are delineated by lines consisting only of strings of =
, -
or _
. For instance, here’s a document from the FiftyOne User Guide which comprises all three delineators:
I could then remove all the RST keywords, reminiscent of toctree
, code-block
, and button_link
(there have been many more), in addition to the :
, ::
, and ..
that accompanied a keyword, the beginning of a brand new block, or block descriptors.
Links were easy to maintain too:
no_links_section = re.sub(r"<[^>]+>_?","", section)
Things began to get dicey after I desired to extract the section anchors from RST files. A lot of our sections had anchors specified explicitly, whereas others were left to be inferred through the conversion to HTML.
Here is an example:
.. _brain-embeddings-visualization:Visualizing embeddings
______________________
The FiftyOne Brain provides a strong
:meth:`compute_visualization() ` method
that you would be able to use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.
These representations will be visualized natively within the App's
:ref:`Embeddings panel `, where you may interactively
select points of interest and think about the corresponding samples/labels of interest
within the :ref:`Samples panel `, and vice versa.
.. image:: /images/brain/brain-mnist.png
:alt: mnist
:align: center
There are two primary components to an embedding visualization: the strategy used
to generate the embeddings, and the dimensionality reduction method used to
compute a low-dimensional representation of the embeddings.
Embedding methods
-----------------
The `embeddings` and `model` parameters of
:meth:`compute_visualization() `
support quite a lot of ways to generate embeddings in your data:
Within the brain.rst file in our User Guide docs (a portion of which is reproduced above), the Visualizing embeddings section has an anchor #brain-embeddings-visualization
specified by .. _brain-embeddings-visualization:
. The Embedding methods subsection which immediately follows, nonetheless, is given an auto-generated anchor.
One other difficulty that soon reared its head was how you can cope with tables in RST. List tables were fairly straightforward. For example, here’s a listing table from our View Stages cheat sheet:
.. list-table::* - :meth:`match() `
* - :meth:`match_frames() `
* - :meth:`match_labels() `
* - :meth:`match_tags() `
Grid tables, then again, can get messy fast. They provide docs writers great flexibility, but this same flexibility makes parsing them a pain. Take this table from our Filtering cheat sheet:
+-----------------------------------------+-----------------------------------------------------------------------+
| Operation | Command |
+=========================================+=======================================================================+
| Filepath starts with "/Users" | .. code-block:: |
| | |
| | ds.match(F("filepath").starts_with("/Users")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath ends with "10.jpg" or "10.png" | .. code-block:: |
| | |
| | ds.match(F("filepath").ends_with(("10.jpg", "10.png")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Label comprises string "be" | .. code-block:: |
| | |
| | ds.filter_labels( |
| | "predictions", |
| | F("label").contains_str("be"), |
| | ) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath comprises "088" and is JPEG | .. code-block:: |
| | |
| | ds.match(F("filepath").re_match("088*.jpg")) |
+-----------------------------------------+-----------------------------------------------------------------------+
Inside a table, rows can take up arbitrary numbers of lines, and columns can vary in width. Code blocks inside grid table cells are also difficult to parse, as they occupy space on multiple lines, so their content is interspersed with content from other columns. Because of this code blocks in these tables must be effectively reconstructed through the parsing process.
Not the top of the world. But additionally not ideal.
Jupyter
Jupyter notebooks turned out to be relatively easy to parse. I used to be in a position to read the contents of a Jupyter notebook into a listing of strings, with one string per cell:
import json
ifile = "my_notebook.ipynb"
with open(ifile, "r") as f:
contents = f.read()
contents = json.loads(contents)["cells"]
contents = [(" ".join(c["source"]), c['cell_type'] for c in contents]
Moreover, the sections were delineated by Markdown cells starting with #
.
Nevertheless, given the challenges posed by RST, I made a decision to show to HTML and treat all of our docs on equal footing.
HTML
I built the HTML docs from my local install with bash generate_docs.bash
, and commenced parsing them with Beautiful Soup. Nonetheless, I soon realized that when RST code blocks and tables with inline code were being converted to HTML, although they were rendering appropriately, the HTML itself was incredibly unwieldy. Take our filtering cheat sheet for instance.
When rendered in a browser, the code block preceding the Dates and times section of our filtering cheat sheet looks like this:
The raw HTML, nonetheless, looks like this:
This just isn’t not possible to parse, but it is usually removed from ideal.
Markdown
Fortunately, I used to be in a position to overcome these issues by converting all the HTML files to Markdown with markdownify. Markdown had a number of key benefits that made it the perfect fit for this job.
- Cleaner than HTML: code formatting was simplified from the spaghetti strings of
span
elements to inline code snippets marked with single`
before and after, and blocks of code were marked by triple quotes```
before and after. This also made it easy to separate into text and code. - Still contained anchors: unlike raw RST, this Markdown included section heading anchors, because the implicit anchors had already been generated. This manner, I could link not only to the page containing the result, but to the precise section or subsection of that page.
- Standardization: Markdown provided a mostly uniform formatting for the initial RST and Jupyter documents, allowing us to provide their content consistent treatment within the vector search application.
Note on LangChain
A few of chances are you’ll know in regards to the open source library LangChain for constructing applications with LLMs, and will be wondering why I didn’t just use LangChain’s Document Loaders and Text Splitters. The reply: I needed more control!
Once the documents had been converted to Markdown, I proceeded to scrub the contents and split them into smaller segments.
Cleansing
Cleansing most consisting in removing unnecessary elements, including:
- Headers and footers
- Table row and column scaffolding — e.g. the
|
’s in|select()| select_by()|
- Extra newlines
- Links
- Images
- Unicode characters
- Bolding — i.e.
**text**
→text
I also removed escape characters that were escaping from characters which have special meaning in our docs: _
and *
. The previous is utilized in many method names, and the latter, as usual, is utilized in multiplication, patterns, and lots of other places:
document = document.replace("_", "_").replace("*", "*")
Splitting documents into semantic blocks
With the contents of our docs cleaned, I proceeded to separate the docs into bite-sized blocks.
First, I split each document into sections. At first glance, it looks as if this will be done by finding any line that starts with a #
character. In my application, I didn’t differentiate between h1, h2, h3, and so forth (#
, ##
, ###
), so checking the primary character is sufficient. Nonetheless, this logic gets us in trouble once we realize that #
can also be employed to permit comments in Python code.
To bypass this problem, I split the document into text blocks and code blocks:
text_and_code = page_md.split('```')
text = text_and_code[::2]
code = text_and_code[1::2]
Then I identified the beginning of a brand new section with a #
to start out a line in a text block. I extracted the section title and anchor from this line:
def extract_title_and_anchor(header):
header = " ".join(header.split(" ")[1:])
title = header.split("[")[0]
anchor = header.split("(")[1].split(" ")[0]
return title, anchor
And assigned each block of text or code to the suitable section.
Initially, I also tried splitting the text blocks into paragraphs, hypothesizing that because a bit may contain details about many alternative topics, the embedding for that entire section might not be just like an embedding for a text prompt concerned with only one among those topics. This approach, nonetheless, resulted in top matches for many search queries disproportionately being single line paragraphs, which turned out to not be terribly informative as search results.
Take a look at the accompanying GitHub repo for the implementation of those methods that you would be able to check out on your personal docs!
With documents converted, processed, and split into strings, I generated an embedding vector for every of those blocks. Because large language models are flexible and customarily capable by nature, I made a decision to treat each text blocks and code blocks on the identical footing as pieces of text, and to embed them with the identical model.
I used OpenAI’s text-embedding-ada-002 model since it is straightforward to work with, achieves the very best performance out of all of OpenAI’s embedding models (on the BEIR benchmark), and can also be the most affordable. It’s so low cost the truth is ($0.0004/1K tokens) that generating all the embeddings for the FiftyOne docs only cost a number of cents! As OpenAI themselves put it, “We recommend using text-embedding-ada-002 for nearly all use cases. It’s higher, cheaper, and simpler to make use of.”
With this embedding model, you may generate a 1536-dimensional vector representing any input prompt, as much as 8,191 tokens (roughly 30,000 characters).
To start, you must create an OpenAI account, generate an API key at https://platform.openai.com/account/api-keys, export this API key as an environment variable with:
export OPENAI_API_KEY=""
You may even need to put in the openai Python library:
pip install openai
I wrote a wrapper around OpenAI’s API that takes in a text prompt and returns an embedding vector:
MODEL = "text-embedding-ada-002"def embed_text(text):
response = openai.Embedding.create(
input=text,
model=MODEL
)
embeddings = response['data'][0]['embedding']
return embeddings
To generate embeddings for all of our docs, we just apply this function to every of the subsections — text and code blocks — across all of our docs.
With embeddings in hand, I created a vector index to go looking against. I selected to make use of Qdrant for a similar reasons we selected so as to add native Qdrant support to FiftyOne: it’s open source, free, and straightforward to make use of.
To start with Qdrant, you may pull a pre-built Docker image and run the container:
docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant
Moreover, you will have to put in the Qdrant Python client:
pip install qdrant-client
I created the Qdrant collection:
import qdrant_client as qc
import qdrant_client.http.models as qmodelsclient = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
DIMENSION = 1536
COLLECTION_NAME = "fiftyone_docs"
def create_index():
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config = qmodels.VectorParams(
size=DIMENSION,
distance=METRIC,
)
)
I then created a vector for every subsection (text or code block):
import uuid
def create_subsection_vector(
subsection_content,
section_anchor,
page_url,
doc_type
):vector = embed_text(subsection_content)
id = str(uuid.uuid1().int)[:32]
payload = {
"text": subsection_content,
"url": page_url,
"section_anchor": section_anchor,
"doc_type": doc_type,
"block_type": block_type
}
return id, vector, payload
For every vector, you may provide additional context as a part of the payload. On this case, I included the URL (and anchor) where the result will be found, the type of document, so the user can specify in the event that they want to go looking through all the docs, or simply certain forms of docs, and the contents of the string which generated the embedding vector. I also added the block type (text or code), so if the user is searching for a code snippet, they will tailor their search to that purpose.
Then I added these vectors to the index, one page at a time:
def add_doc_to_index(subsections, page_url, doc_type, block_type):
ids = []
vectors = []
payloads = []for section_anchor, section_content in subsections.items():
for subsection in section_content:
id, vector, payload = create_subsection_vector(
subsection,
section_anchor,
page_url,
doc_type,
block_type
)
ids.append(id)
vectors.append(vector)
payloads.append(payload)
## Add vectors to collection
client.upsert(
collection_name=COLLECTION_NAME,
points=qmodels.Batch(
ids = ids,
vectors=vectors,
payloads=payloads
),
)
Once the index has been created, running a search on the indexed documents will be completed by embedding the query text with the identical embedding model, after which searching the index for similar embedding vectors. With a Qdrant vector index, a basic query will be performed with the Qdrant client’s search()
command.
To make my company’s docs searchable, I desired to allow users to filter by section of the docs, in addition to by the kind of block that was encoded. Within the parlance of vector search, filtering results while still ensuring that a predetermined variety of results (specified by the top_k
argument) can be returned is known as pre-filtering.
To attain this, I wrote a programmatic filter:
def _generate_query_filter(query, doc_types, block_types):
"""Generates a filter for the query.
Args:
query: A string containing the query.
doc_types: An inventory of document types to go looking.
block_types: An inventory of block types to go looking.
Returns:
A filter for the query.
"""
doc_types = _parse_doc_types(doc_types)
block_types = _parse_block_types(block_types)_filter = models.Filter(
must=[
models.Filter(
should= [
models.FieldCondition(
key="doc_type",
match=models.MatchValue(value=dt),
)
for dt in doc_types
],
),
models.Filter(
should= [
models.FieldCondition(
key="block_type",
match=models.MatchValue(value=bt),
)
for bt in block_types
]
)
]
)
return _filter
The interior _parse_doc_types()
and _parse_block_types()
functions handle cases where the argument is string or list-valued, or is None.
Then I wrote a function query_index()
that takes the user’s text query, pre-filters, searches the index, and extracts relevant information from the payload. The function returns a listing of tuples of the shape (url, contents, rating)
, where the rating indicates how good of a match the result’s to the query text.
def query_index(query, top_k=10, doc_types=None, block_types=None):
vector = embed_text(query)
_filter = _generate_query_filter(query, doc_types, block_types)results = CLIENT.search(
collection_name=COLLECTION_NAME,
query_vector=vector,
query_filter=_filter,
limit=top_k,
with_payload=True,
search_params=_search_params,
)
results = [
(
f"{res.payload['url']}#{res.payload['section_anchor']}",
res.payload["text"],
res.rating,
)
for res in results
]
return results
The ultimate step was providing a clean interface for the user to semantically search against these “vectorized” docs.
I wrote a function print_results()
, which takes the query, results from query_index()
, and a rating
argument (whether or to not print the similarity rating), and prints the ends in a straightforward to interpret way. I used the wealthy Python package to format hyperlinks within the terminal in order that when working in a terminal that supports hyperlinks, clicking on the hyperlink will open the page in your default browser. I also used webbrowser to robotically open the link for the highest result, if desired.
For Python-based searches, I created a category FiftyOneDocsSearch
to encapsulate the document search behavior, so that after a FiftyOneDocsSearch
object has been instantiated (potentially with default settings for search arguments):
from fiftyone.docs_search import FiftyOneDocsSearch
fosearch = FiftyOneDocsSearch(open_url=False, top_k=3, rating=True)
You possibly can search inside Python by calling this object. To question the docs for “The right way to load a dataset”, as an example, you only must run:
fosearch(“The right way to load a dataset”)
I also used argparse to make this docs search functionality available via the command line. When the package is installed, the docs are CLI searchable with:
fiftyone-docs-search query ""
Only for fun, because fiftyone-docs-search query
is a bit cumbersome, I added an alias to my .zsrch
file:
alias fosearch='fiftyone-docs-search query'
With this alias, the docs are searchable from the command line with:
fosearch "" args
Coming into this, I already fashioned myself an influence user of my company’s open source Python library, FiftyOne. I had written most of the docs, and I had used (and proceed to make use of) the library each day. However the strategy of turning our docs right into a searchable database forced me to grasp our docs on an excellent deeper level. It’s all the time great if you’re constructing something for others, and it finally ends up helping you as well!
Here’s what I learned:
- Sphinx RST is cumbersome: it makes beautiful docs, but it surely is a little bit of a pain to parse
- Don’t go crazy with preprocessing: OpenAI’s text-embeddings-ada-002 model is great at understanding the meaning behind a text string, even when it has barely atypical formatting. Gone are the times of stemming and painstakingly removing stop words and miscellaneous characters.
- Small semantically meaningful snippets are best: break your documents up into the smallest possible meaningful segments, and retain context. For longer pieces of text, it’s more likely that overlap between a search query and an element of the text in your index can be obscured by less relevant text within the segment. Should you break the document up too small, you run the chance that many entries within the index will contain little or no semantic information.
- Vector search is powerful: with minimal lift, and with none fine-tuning, I used to be in a position to dramatically enhance the searchability of our docs. From initial estimates, it seems that this improved docs search is greater than twice as prone to return relevant results than the old keyword search approach. Moreover, the semantic nature of this vector search approach signifies that users can now search with arbitrarily phrased, arbitrarily complex queries, and are guaranteed to get the required variety of results.
Should you end up (or others) consistently digging or sifting through treasure troves of documentation for specific kernels of data, I encourage you to adapt this process for your personal use case. You possibly can modify this to work in your personal documents, or your organization’s archives. And in case you do, I guarantee you’ll walk away from the experience seeing your documents in a brand new light!
Listed here are a number of ways you can extend this for your personal docs!
- Hybrid search: mix vector search with traditional keyword search
- Go global: Use Qdrant Cloud to store and query the gathering within the cloud
- Incorporate web data: use requests to download HTML directly from the online
- Automate updates: use Github Actions to trigger recomputation of embeddings at any time when the underlying docs change
- Embed: wrap this in a Javascript element and drop it in as a alternative for a standard search bar
All code used to construct the package is open source, and will be present in the voxel51/fiftyone-docs-search repo.