A tale of taming unruly documents to create the final word GPT-based chatbot
Picture this: you’re at a rapidly growing tech company, and also you’ve been given the mission to create a state-of-the-art chatbot using the mind-blowing GPT technology. This chatbot is destined to turn out to be the corporate’s crown jewel, a virtual oracle that’ll answer questions based on the treasure trove of information stored in your Confluence spaces. Feels like a dream job, right?
But, as you’re taking a better have a look at the Confluence knowledge base, reality hits. It’s a wild jungle of empty/incomplete pages, irrelevant documents and duplicate content. It’s like someone dumped a thousand jigsaw puzzles into a large blender and pressed “start.” And now, it’s your job to scrub up this mess before you possibly can even take into consideration constructing that incredible chatbot.
Luckily for you, in this text, we’ll embark on an exciting journey to overcome the Confluence chaos, using the facility of Python and BERTopic to discover and eliminate those annoying outliers. So, buckle up and prepare to remodel your knowledge base into the right training ground in your cutting-edge GPT-based chatbot.
As you face the daunting task of cleansing up your Confluence knowledge base, you would possibly consider diving in manually, sorting through each document one after the other. Nevertheless, the manual approach is slow, labor-intensive, and error-prone. In any case, even essentially the most meticulous worker can overlook essential details or misjudge the relevance of a document.
Along with your knowledge of Python, you is likely to be tempted to create a heuristic-based solution, using a set of predefined rules to discover and eliminate outliers. While this approach is quicker than manual cleanup, it has its limitations. Heuristics might be rigid and struggle to adapt to the complex and ever-evolving nature of your Confluence spaces, often resulting in suboptimal results.
Enter Python and BERTopic, a robust combination that may enable you tackle the challenge of cleansing up your Confluence knowledge base more effectively. Python is a flexible programming language, while BERTopic is a complicated topic modeling library that may analyze your documents and group them based on their underlying topics.
In the subsequent paragraphs, we’ll explore how Python and BERTopic can work together to automate the strategy of identifying and eliminating outliers in your Confluence spaces. By harnessing their combined powers, you’ll save time and resources while increasing the accuracy and effectiveness of your cleanup efforts.
Alright, from this point on, I’ll walk you thru the strategy of making a Python script using BERTopic to discover and eliminate outliers in your Confluence knowledge base. The goal is to generate a ranked list of documents based on their “unrelatedness” rating (which we’ll define later). The ultimate output will consist of the document’s title, a preview of the text (first 100 characters), and the unrelatedness rating. The ultimate output will appear as follows:
(Title: “AI in Healthcare”, Preview: “Artificial intelligence is transforming…”, Unrelatedness: 0.95)
(Title: “Office Birthday Party Guidelines”, Preview: “To make sure a fun and secure…”, Unrelatedness: 0.8)
The essential steps on this process include:
- Hook up with Confluence and download documents: establish a connection to your Confluence account and fetch the documents for processing. This section provides guidance on organising the connection, authenticating, and downloading the vital data.
- HTML processing and text extraction using Beautiful Soup: use Beautiful Soup, a robust Python library, to administer HTML content and extract the text from Confluence documents. This step involves cleansing up the extracted text, removing unwanted elements, and preparing the information for evaluation.
- Apply BERTopic and create the rating: with the cleaned-up text in hand, apply BERTopic to research and group the documents based on their underlying topics. After obtaining the subject representations, calculate the “unrelatedness” measure for every document and create a rating to discover and eliminate outliers in your Confluence knowledge base.
Finally the code. Here, we’ll start downloading documents from a Confluence space, we’ll then process the HTML content, and we’ll extract the text for the subsequent phase (BERTopic!).
First, we want to hook up with Confluence via API. Because of the atlassian-python-api library, that might be done with a number of lines of code. When you don’t have an API token for Atlassian, read this guide to set that up.
import os
import re
from atlassian import Confluence
from bs4 import BeautifulSoup# Arrange Confluence API client
confluence = Confluence(
url='YOUR_CONFLUENCE URL',
username="YOUR_EMAIL",
password="YOUR_API_KEY",
cloud=True)
# Replace SPACE_KEY with the specified Confluence space key
space_key = 'YOUR_SPACE'
def get_all_pages_from_space_with_pagination(space_key):
limit = 50
start = 0
all_pages = []
while True:
pages = confluence.get_all_pages_from_space(space_key, start=start, limit=limit)
if not pages:
break
all_pages.extend(pages)
start += limit
return all_pages
pages = get_all_pages_from_space_with_pagination(space_key)
After fetching the pages, we’ll create a directory for the text files, extract the pages’ content and save the text content to individual files:
# Function to sanitize filenames
def sanitize_filename(filename):
return "".join(c for c in filename if c.isalnum() or c in (' ', '.', '-', '_')).rstrip()# Create a directory for the text files if it doesn't exist
if not os.path.exists('txt_files'):
os.makedirs('txt_files')
# Extract pages and save to individual text files
for page in pages:
page_id = page['id']
page_title = page['title']
# Fetch the page content
page_content = confluence.get_page_by_id(page_id, expand='body.storage')
# Extract the content within the "storage" format
storage_value = page_content['body']['storage']['value']
# Clean the HTML tags to get the text content
text_content = process_html_document(storage_value)
file_name = f'txt_files/{sanitize_filename(page_title)}_{page_id}.txt'
with open(file_name, 'w', encoding='utf-8') as txtfile:
txtfile.write(text_content)
The function process_html_document
carries out all of the vital cleansing tasks to extract the text from the downloaded pages while maintaining a coherent format. The extent to which you need to refine this process is determined by your specific requirements. On this case, we deal with handling tables and lists to be certain that the resulting text document retains a format just like the unique layout.
import spacynlp = spacy.load("en_core_web_sm")
def html_table_to_text(html_table):
soup = BeautifulSoup(html_table, "html.parser")# Extract table rows
rows = soup.find_all("tr")# Determine if the table has headers or not
has_headers = any(th for th in soup.find_all("th"))# Extract table headers, either from the primary row or from the
elements
if has_headers:
headers = [th.get_text(strip=True) for th in soup.find_all("th")]
row_start_index = 1 # Skip the primary row, because it accommodates headers
else:
first_row = rows[0]
headers = [cell.get_text(strip=True) for cell in first_row.find_all("td")]
row_start_index = 1# Iterate through rows and cells, and use NLP to generate sentences
text_rows = []
for row in rows[row_start_index:]:
cells = row.find_all("td")
cell_sentences = []
for header, cell in zip(headers, cells):
# Generate a sentence using the header and cell value
doc = nlp(f"{header}: {cell.get_text(strip=True)}")
sentence = " ".join([token.text for token in doc if not token.is_stop])
cell_sentences.append(sentence)# Mix cell sentences right into a single row text
row_text = ", ".join(cell_sentences)
text_rows.append(row_text)# Mix row texts right into a single text
text = "nn".join(text_rows)
return textdef html_list_to_text(html_list):
soup = BeautifulSoup(html_list, "html.parser")
items = soup.find_all("li")
text_items = []
for item in items:
item_text = item.get_text(strip=True)
text_items.append(f"- {item_text}")
text = "n".join(text_items)
return textdef process_html_document(html_document):
soup = BeautifulSoup(html_document, "html.parser")# Replace tables with text using html_table_to_text
for table in soup.find_all("table"):
table_text = html_table_to_text(str(table))
table.replace_with(BeautifulSoup(table_text, "html.parser"))# Replace lists with text using html_list_to_text
for ul in soup.find_all("ul"):
ul_text = html_list_to_text(str(ul))
ul.replace_with(BeautifulSoup(ul_text, "html.parser"))for ol in soup.find_all("ol"):
ol_text = html_list_to_text(str(ol))
ol.replace_with(BeautifulSoup(ol_text, "html.parser"))# Replace every type of
with newlines
br_tags = re.compile('
|
|
')
html_with_newlines = br_tags.sub('n', str(soup))# Strip remaining HTML tags to isolate the text
soup_with_newlines = BeautifulSoup(html_with_newlines, "html.parser")return soup_with_newlines.get_text()
On this final chapter, we’ll finally leverage BERTopic, a robust topic modeling technique that utilizes BERT embeddings. You may learn more about BERTopic of their GitHub repository and their documentation.
Our approach to finding outliers consists of running BERTopic with different values for the variety of topics. In each iteration, we’ll collect all documents that fall into the Outlier cluster (-1). The more steadily a document appears within the -1 cluster, the more likely it’s to be considered an outlier. This frequency forms the primary component of our unrelatedness rating. BERTopic also provides a probability value for documents within the -1 cluster. We’ll calculate the common of those probabilities for every document over all of the iterations. This average represents the second component of our unrelatedness rating. Finally, we’ll determine the general unrelatedness rating for every document by computing the common of the 2 scores (frequency and probability). This combined rating will help us discover essentially the most unrelated documents in our dataset.
Here is the initial code:
import numpy as np
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from sklearn.feature_extraction.text import CountVectorizervectorizer_model = CountVectorizer(stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.2)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)# Collect text and filenames from chunks within the txt_files directory
documents = []
filenames = []for file in os.listdir('txt_files'):
if file.endswith('.txt'):
with open(os.path.join('txt_files', file), 'r', encoding='utf-8') as f:
documents.append(f.read())
filenames.append(file)On this code block, we arrange the vital tools for BERTopic by importing the required libraries and initializing the models. We define 3 models that shall be utilized by BERTopic:
vectorizer_model
: theCountVectorizer
model tokenizes the documents and creates a document-term matrix where each entry represents the count of a term in a document. It also removes English stop words from the documents to enhance topic modeling performance.representation_model
: theMaximalMarginalRelevance
(MMR) model diversifies the extracted topics by considering each the relevance and variety of topics. Thediversity
parameter controls the trade-off between these two points, with higher values resulting in more diverse topics.ctfidf_model
: theClassTfidfTransformer
model adjusts the term frequency-inverse document frequency (TF-IDF) scores of the document-term matrix to higher represent topics. It reduces the impact of steadily occurring words across topics and enhances the excellence between topics.We then collect the text and filenames of the documents from the ‘txt_files’ directory to process them with BERTopic in the subsequent step.
def extract_topics(docs, n_topics):
model = BERTopic(nr_topics=n_topics, calculate_probabilities=True, language="english",
ctfidf_model=ctfidf_model, representation_model=representation_model,
vectorizer_model=vectorizer_model)
topics, probabilities = model.fit_transform(docs)
return model, topics, probabilitiesdef find_outlier_topic(model):
topic_sizes = model.get_topic_freq()
outlier_topic = topic_sizes.iloc[-1]["Topic"]
return outlier_topicoutlier_counts = np.zeros(len(documents))
outlier_probs = np.zeros(len(documents))# Define the range of topics you need to try
min_topics = 5
max_topics = 10for n_topics in range(min_topics, max_topics + 1):
model, topics, probabilities = extract_topics(documents, n_topics)
outlier_topic = find_outlier_topic(model)for i, (topic, prob) in enumerate(zip(topics, probabilities)):
if topic == outlier_topic:
outlier_counts[i] += 1
outlier_probs[i] += prob[outlier_topic]Within the above section, we use BERTopic to discover outlier documents by iterating through a spread of topic counts from a specified minimum to a maximum. For every topic count, BERTopic extracts the topics and their corresponding probabilities. It then identifies the outlier topic and updates the
outlier_counts
andoutlier_probs
for documents assigned to this outlier topic. This process iteratively accumulates counts and probabilities, providing a measure of how often and the way ‘strongly’ documents are classified as outliers.Finally, we are able to compute our unrelatedness rating and print the outcomes:
def normalize(arr):
min_val, max_val = np.min(arr), np.max(arr)
return (arr - min_val) / (max_val - min_val)# Average the chances
avg_outlier_probs = np.divide(outlier_probs, outlier_counts, out=np.zeros_like(outlier_probs), where=outlier_counts != 0)# Normalize counts
normalized_counts = normalize(outlier_counts)# Compute the combined unrelatedness rating by averaging the normalized counts and probabilities
unrelatedness_scores = [(i, (count + prob) / 2) for i, (count, prob) in enumerate(zip(normalized_counts, avg_outlier_probs))]
unrelatedness_scores.sort(key=lambda x: x[1], reverse=True)# Print the filtered results
for index, rating in unrelatedness_scores:
if rating > 0:
title = filenames[index]
preview = documents[index][:100] + "..." if len(documents[index]) > 100 else documents[index]
print(f"Title: {title}, Preview: {preview}, Unrelatedness: {rating:.2f}")
print("n")And that’s it! Here you should have your list of outliers documents ranked by unrelatedness. By cleansing up your Confluence spaces and removing irrelevant content, you possibly can pave the way in which for making a more efficient and priceless chatbot that leverages your organization’s knowledge. Comfortable cleansing!