Home Artificial Intelligence How I Turned My Company’s Docs right into a Searchable Database with OpenAI Converting the docs to a unified format Processing the documents Embedding text and code blocks with OpenAI Making a Qdrant vector index Querying the index Writing the search wrapper Conclusion

How I Turned My Company’s Docs right into a Searchable Database with OpenAI Converting the docs to a unified format Processing the documents Embedding text and code blocks with OpenAI Making a Qdrant vector index Querying the index Writing the search wrapper Conclusion

0
How I Turned My Company’s Docs right into a Searchable Database with OpenAI
Converting the docs to a unified format
Processing the documents
Embedding text and code blocks with OpenAI
Making a Qdrant vector index
Querying the index
Writing the search wrapper
Conclusion

Image courtesy of Unsplash.
Semantically search your organization’s docs from the command line. Image courtesy of creator.
  • Install the openai Python package and create an account: you’ll use this account to send your docs and queries to an inference endpoint, which can return an embedding vector for every bit of text.
  • Install the qdrant-client Python package and launch a Qdrant server via Docker: you’ll use Qdrant to create a locally hosted vector index for the docs, against which queries can be run. The Qdrant service will run inside a Docker container.

RST

RST document from open source FiftyOne Docs. Image courtesy of creator.
no_links_section = re.sub(r"<[^>]+>_?","", section)
.. _brain-embeddings-visualization:

Visualizing embeddings
______________________

The FiftyOne Brain provides a strong
:meth:`compute_visualization() ` method
that you would be able to use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.

These representations will be visualized natively within the App's
:ref:`Embeddings panel `, where you may interactively
select points of interest and think about the corresponding samples/labels of interest
within the :ref:`Samples panel `, and vice versa.

.. image:: /images/brain/brain-mnist.png
:alt: mnist
:align: center

There are two primary components to an embedding visualization: the strategy used
to generate the embeddings, and the dimensionality reduction method used to
compute a low-dimensional representation of the embeddings.

Embedding methods
-----------------

The `embeddings` and `model` parameters of
:meth:`compute_visualization() `
support quite a lot of ways to generate embeddings in your data:

.. list-table::

* - :meth:`match() `
* - :meth:`match_frames() `
* - :meth:`match_labels() `
* - :meth:`match_tags() `

+-----------------------------------------+-----------------------------------------------------------------------+
| Operation | Command |
+=========================================+=======================================================================+
| Filepath starts with "/Users" | .. code-block:: |
| | |
| | ds.match(F("filepath").starts_with("/Users")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath ends with "10.jpg" or "10.png" | .. code-block:: |
| | |
| | ds.match(F("filepath").ends_with(("10.jpg", "10.png")) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Label comprises string "be" | .. code-block:: |
| | |
| | ds.filter_labels( |
| | "predictions", |
| | F("label").contains_str("be"), |
| | ) |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath comprises "088" and is JPEG | .. code-block:: |
| | |
| | ds.match(F("filepath").re_match("088*.jpg")) |
+-----------------------------------------+-----------------------------------------------------------------------+

Jupyter

import json
ifile = "my_notebook.ipynb"
with open(ifile, "r") as f:
contents = f.read()
contents = json.loads(contents)["cells"]
contents = [(" ".join(c["source"]), c['cell_type'] for c in contents]

HTML

Screenshot from cheat sheet in open source FiftyOne Docs. Image courtest of creator.
RST cheat sheet converted to HTML. Image courtest of creator.

Markdown

  1. Cleaner than HTML: code formatting was simplified from the spaghetti strings of span elements to inline code snippets marked with single ` before and after, and blocks of code were marked by triple quotes ```before and after. This also made it easy to separate into text and code.
  2. Still contained anchors: unlike raw RST, this Markdown included section heading anchors, because the implicit anchors had already been generated. This manner, I could link not only to the page containing the result, but to the precise section or subsection of that page.
  3. Standardization: Markdown provided a mostly uniform formatting for the initial RST and Jupyter documents, allowing us to provide their content consistent treatment within the vector search application.

Note on LangChain

Cleansing

  • Headers and footers
  • Table row and column scaffolding — e.g. the |’s in |select()| select_by()|
  • Extra newlines
  • Links
  • Images
  • Unicode characters
  • Bolding — i.e. **text**text
document = document.replace("_", "_").replace("*", "*")

Splitting documents into semantic blocks

text_and_code = page_md.split('```')
text = text_and_code[::2]
code = text_and_code[1::2]
def extract_title_and_anchor(header):
header = " ".join(header.split(" ")[1:])
title = header.split("[")[0]
anchor = header.split("(")[1].split(" ")[0]
return title, anchor
export OPENAI_API_KEY=""
pip install openai
MODEL = "text-embedding-ada-002"

def embed_text(text):
response = openai.Embedding.create(
input=text,
model=MODEL
)
embeddings = response['data'][0]['embedding']
return embeddings

docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant
pip install qdrant-client
import qdrant_client as qc
import qdrant_client.http.models as qmodels

client = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
DIMENSION = 1536
COLLECTION_NAME = "fiftyone_docs"

def create_index():
client.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config = qmodels.VectorParams(
size=DIMENSION,
distance=METRIC,
)
)

import uuid
def create_subsection_vector(
subsection_content,
section_anchor,
page_url,
doc_type
):

vector = embed_text(subsection_content)
id = str(uuid.uuid1().int)[:32]
payload = {
"text": subsection_content,
"url": page_url,
"section_anchor": section_anchor,
"doc_type": doc_type,
"block_type": block_type
}
return id, vector, payload

def add_doc_to_index(subsections, page_url, doc_type, block_type):
ids = []
vectors = []
payloads = []

for section_anchor, section_content in subsections.items():
for subsection in section_content:
id, vector, payload = create_subsection_vector(
subsection,
section_anchor,
page_url,
doc_type,
block_type
)
ids.append(id)
vectors.append(vector)
payloads.append(payload)

## Add vectors to collection
client.upsert(
collection_name=COLLECTION_NAME,
points=qmodels.Batch(
ids = ids,
vectors=vectors,
payloads=payloads
),
)

def _generate_query_filter(query, doc_types, block_types):
"""Generates a filter for the query.
Args:
query: A string containing the query.
doc_types: An inventory of document types to go looking.
block_types: An inventory of block types to go looking.
Returns:
A filter for the query.
"""
doc_types = _parse_doc_types(doc_types)
block_types = _parse_block_types(block_types)

_filter = models.Filter(
must=[
models.Filter(
should= [
models.FieldCondition(
key="doc_type",
match=models.MatchValue(value=dt),
)
for dt in doc_types
],

),
models.Filter(
should= [
models.FieldCondition(
key="block_type",
match=models.MatchValue(value=bt),
)
for bt in block_types
]
)
]
)

return _filter

def query_index(query, top_k=10, doc_types=None, block_types=None):
vector = embed_text(query)
_filter = _generate_query_filter(query, doc_types, block_types)

results = CLIENT.search(
collection_name=COLLECTION_NAME,
query_vector=vector,
query_filter=_filter,
limit=top_k,
with_payload=True,
search_params=_search_params,
)

results = [
(
f"{res.payload['url']}#{res.payload['section_anchor']}",
res.payload["text"],
res.rating,
)
for res in results
]

return results

Display search results with wealthy hyperlinks. Image courtesy of creator.
from fiftyone.docs_search import FiftyOneDocsSearch
fosearch = FiftyOneDocsSearch(open_url=False, top_k=3, rating=True)
fosearch(“The right way to load a dataset”)
Semantically search your organization’s docs inside a Python process. Image courtesy of creator.
fiftyone-docs-search query "" 
alias fosearch='fiftyone-docs-search query'
fosearch "" args
  • Sphinx RST is cumbersome: it makes beautiful docs, but it surely is a little bit of a pain to parse
  • Don’t go crazy with preprocessing: OpenAI’s text-embeddings-ada-002 model is great at understanding the meaning behind a text string, even when it has barely atypical formatting. Gone are the times of stemming and painstakingly removing stop words and miscellaneous characters.
  • Small semantically meaningful snippets are best: break your documents up into the smallest possible meaningful segments, and retain context. For longer pieces of text, it’s more likely that overlap between a search query and an element of the text in your index can be obscured by less relevant text within the segment. Should you break the document up too small, you run the chance that many entries within the index will contain little or no semantic information.
  • Vector search is powerful: with minimal lift, and with none fine-tuning, I used to be in a position to dramatically enhance the searchability of our docs. From initial estimates, it seems that this improved docs search is greater than twice as prone to return relevant results than the old keyword search approach. Moreover, the semantic nature of this vector search approach signifies that users can now search with arbitrarily phrased, arbitrarily complex queries, and are guaranteed to get the required variety of results.
  • Hybrid search: mix vector search with traditional keyword search
  • Go global: Use Qdrant Cloud to store and query the gathering within the cloud
  • Incorporate web data: use requests to download HTML directly from the online
  • Automate updates: use Github Actions to trigger recomputation of embeddings at any time when the underlying docs change
  • Embed: wrap this in a Javascript element and drop it in as a alternative for a standard search bar

LEAVE A REPLY

Please enter your comment!
Please enter your name here