Home Artificial Intelligence Document Topic Extraction with Large Language Models (LLM) and the Latent Dirichlet Allocation (LDA) Algorithm

Document Topic Extraction with Large Language Models (LLM) and the Latent Dirichlet Allocation (LDA) Algorithm

0
Document Topic Extraction with Large Language Models (LLM) and the Latent Dirichlet Allocation (LDA) Algorithm

A guide on how one can efficiently extract topics from large documents using Large Language Models (LLM) and the Latent Dirichlet Allocation (LDA) algorithm.

Towards Data Science
Photo by Henry Be on Unsplash

Introduction

I used to be developing an internet application for chatting with PDF files, able to processing large documents, above 1000 pages. But before starting a conversation with the document, I wanted the applying to present the user a temporary summary of the fundamental topics, so it might be easier to start out the interaction.

One option to do it’s by summarizing the document using LangChain, as showed in its documentation. The issue, nevertheless, is the high computational cost and, by extension, the monetary cost. A thousand-page document incorporates roughly 250 000 words and every word must be fed into the LLM. Much more, the outcomes should be further processed, as with the map-reduce method. A conservative estimate on the price using gpt-3.5 Turbo with 4k context is above 1$ per document, only for the summary. Even when using free resources, corresponding to the Unofficial HuggingChat API, the sheer variety of required API calls could be an abuse. So, I needed a special approach.

LDA to the Rescue

The Latent Dirichlet Allocation algorithm was a natural selection for this task. This algorithm takes a set of “documents” (on this context, a “document” refers to a bit of text) and returns an inventory of topics for every “document” together with an inventory of words related to each topic. What is essential for our case is the list of words related to each topic. These lists of words encode the content of the file, so that they might be fed to the LLM to prompt for a summary. I like to recommend this text for an in depth explanation of the algorithm.

There are two key considerations to deal with before we could get a high-quality result: choosing the hyperparameters for the LDA algorithm and determining the format of the output. An important hyperparameter to contemplate is the variety of topics, because it has essentially the most significant on the . As for the format of the output, one which worked pretty much is the nested bulleted list. On this format, each topic is represented as a bulleted list with subentries that further describe the subject. As for why this works, I feel that, through the use of this format, the model can give attention to extracting content from the lists without the complexity of articulating paragraphs with connectors and relationships.

Implementation

I implemented the code in Google Colab. The essential libraries were gensim for LDA, pypdf for PDF processing, nltk for word processing, and LangChain for its promt templates and its interface with the OpenAI API.

import gensim
import nltk
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from pypdf import PdfReader
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain.llms import OpenAI

Next, I defined a utility function, preprocess, to help in processing the input text. It removes stop words and short tokens.

def preprocess(text, stop_words):
"""
Tokenizes and preprocesses the input text, removing stopwords and short
tokens.

Parameters:
text (str): The input text to preprocess.
stop_words (set): A set of stopwords to be faraway from the text.
Returns:
list: A listing of preprocessed tokens.
"""
result = []
for token in simple_preprocess(text, deacc=True):
if token not in stop_words and len(token) > 3:
result.append(token)
return result

The second function, get_topic_lists_from_pdf, implements the LDA portion of the code. I accepts the trail to the PDF file, the variety of topics, and the variety of words per topic, and it returns an inventory. Each element on this list incorporates an inventory of words associate with each topic. Here, we’re considering each page from the PDF file to be a “document”.

def get_topic_lists_from_pdf(file, num_topics, words_per_topic):
"""
Extracts topics and their associated words from a PDF document using the
Latent Dirichlet Allocation (LDA) algorithm.

Parameters:
file (str): The trail to the PDF file for topic extraction.
num_topics (int): The variety of topics to find.
words_per_topic (int): The variety of words to incorporate per topic.

Returns:
list: A listing of num_topics sublists, each containing relevant words
for a subject.
"""
# Load the pdf file
loader = PdfReader(file)

# Extract the text from each page into an inventory. Each page is taken into account a document
documents= []
for page in loader.pages:
documents.append(page.extract_text())

# Preprocess the documents
nltk.download('stopwords')
stop_words = set(stopwords.words(['english','spanish']))
processed_documents = [preprocess(doc, stop_words) for doc in documents]

# Create a dictionary and a corpus
dictionary = corpora.Dictionary(processed_documents)
corpus = [dictionary.doc2bow(doc) for doc in processed_documents]

# Construct the LDA model
lda_model = LdaModel(
corpus,
num_topics=num_topics,
id2word=dictionary,
passes=15
)

# Retrieve the topics and their corresponding words
topics = lda_model.print_topics(num_words=words_per_topic)

# Store each list of words from each topic into an inventory
topics_ls = []
for topic in topics:
words = topic[1].split("+")
topic_words = [word.split("*")[1].replace('"', '').strip() for word in words]
topics_ls.append(topic_words)

return topics_ls

The following function, topics_from_pdf, invokes the LLM model. As stated earlier, the model was prompted to format the output as a nested bulleted list.

def topics_from_pdf(llm, file, num_topics, words_per_topic):
"""
Generates descriptive prompts for LLM based on topic words extracted from a
PDF document.

This function takes the output of `get_topic_lists_from_pdf` function,
which consists of an inventory of topic-related words for every topic, and
generates an output string in table of content format.

Parameters:
llm (LLM): An instance of the Large Language Model (LLM) for generating
responses.
file (str): The trail to the PDF file for extracting topic-related words.
num_topics (int): The variety of topics to contemplate.
words_per_topic (int): The variety of words per topic to incorporate.

Returns:
str: A response generated by the language model based on the provided
topic words.
"""

# Extract topics and convert to string
list_of_topicwords = get_topic_lists_from_pdf(file, num_topics,
words_per_topic)
string_lda = ""
for list in list_of_topicwords:
string_lda += str(list) + "n"

# Create the template
template_string = '''Describe the subject of every of the {num_topics}
double-quote delimited lists in an easy sentence and in addition write down
three possible different subthemes. The lists are the results of an
algorithm for topic discovery.
Don't provide an introduction or a conclusion, only describe the
topics. Don't mention the word "topic" when describing the topics.
Use the next template for the response.

1: <<<(sentence describing the topic)>>>
- <<<(Phrase describing the first subtheme)>>>
- <<<(Phrase describing the second subtheme)>>>
- <<<(Phrase describing the third subtheme)>>>

2: <<<(sentence describing the topic)>>>
- <<<(Phrase describing the first subtheme)>>>
- <<<(Phrase describing the second subtheme)>>>
- <<<(Phrase describing the third subtheme)>>>

...

n: <<<(sentence describing the topic)>>>
- <<<(Phrase describing the first subtheme)>>>
- <<<(Phrase describing the second subtheme)>>>
- <<<(Phrase describing the third subtheme)>>>

Lists: """{string_lda}""" '''

# LLM call
prompt_template = ChatPromptTemplate.from_template(template_string)
chain = LLMChain(llm=llm, prompt=prompt_template)
response = chain.run({
"string_lda" : string_lda,
"num_topics" : num_topics
})

return response

Within the previous function, the list of words is converted right into a string. Then, a prompt is created using the ChatPromptTemplate object from LangChain; note that the prompt defines the structure for the response. Finally, the function calls chatgpt-3.5 Turbo model. The return value is the response given by the LLM model.

Now, it’s time to call the functions. We first set the API key. This article offers instructions on how one can get one.

openai_key = "sk-p..."
llm = OpenAI(openai_api_key=openai_key, max_tokens=-1)

Next, we call the topics_from_pdf function. I select the values for the variety of topics and the variety of words per topic. Also, I chosen a public domain book, The Metamorphosis by Franz Kafka, for testing. The document is stored in my personal drive and downloaded through the use of the gdown library.

!gdown https://drive.google.com/uc?id=1mpXUmuLGzkVEqsTicQvBPcpPJW0aPqdL

file = "./the-metamorphosis.pdf"
num_topics = 6
words_per_topic = 30

summary = topics_from_pdf(llm, file, num_topics, words_per_topic)

The result’s displayed below:

1: Exploring the transformation of Gregor Samsa and the consequences on his family and lodgers
- Understanding Gregor's metamorphosis
- Examining the reactions of Gregor's family and lodgers
- Analyzing the impact of Gregor's transformation on his family

2: Examining the events surrounding the invention of Gregor's transformation
- Investigating the initial reactions of Gregor's family and lodgers
- Analyzing the behavior of Gregor's family and lodgers
- Exploring the physical changes in Gregor's environment

3: Analyzing the pressures placed on Gregor's family as a result of his transformation
- Examining the financial strain on Gregor's family
- Investigating the emotional and psychological effects on Gregor's family
- Examining the changes in family dynamics as a result of Gregor's metamorphosis

4: Examining the implications of Gregor's transformation
- Investigating the physical changes in Gregor's environment
- Analyzing the reactions of Gregor's family and lodgers
- Investigating the emotional and psychological effects on Gregor's family

5: Exploring the impact of Gregor's transformation on his family
- Analyzing the financial strain on Gregor's family
- Examining the changes in family dynamics as a result of Gregor's metamorphosis
- Investigating the emotional and psychological effects on Gregor's family

6: Investigating the physical changes in Gregor's environment
- Analyzing the reactions of Gregor's family and lodgers
- Examining the implications of Gregor's transformation
- Exploring the impact of Gregor's transformation on his family

The output is pretty decent, and it just took seconds! It appropriately extracted the fundamental ideas from the book.

This approach works with technical books as well. For instance, The Foundations of Geometry by David Hilbert (1899) (also in the general public domain):

1: Analyzing the properties of geometric shapes and their relationships
- Exploring the axioms of geometry
- Analyzing the congruence of angles and features
- Investigating theorems of geometry

2: Studying the behavior of rational functions and algebraic equations
- Examining the straight lines and points of an issue
- Investigating the coefficients of a function
- Examining the development of a definite integral

3: Investigating the properties of a number system
- Exploring the domain of a real group
- Analyzing the theory of equal segments
- Examining the circle of arbitrary displacement

4: Examining the realm of geometric shapes
- Analyzing the parallel lines and points
- Investigating the content of a triangle
- Examining the measures of a polygon

5: Examining the theorems of algebraic geometry
- Exploring the congruence of segments
- Analyzing the system of multiplication
- Investigating the valid theorems of a call

6: Investigating the properties of a figure
- Examining the parallel lines of a triangle
- Analyzing the equation of joining sides
- Examining the intersection of segments

Conclusion

Combining the LDA algorithm with LLM for giant document topic extraction produces good results while significantly reducing each cost and processing time. We’ve gone from a whole bunch of API calls to simply one and from minutes to seconds.

The standard of the output depends greatly on its format. On this case, a nested bulleted list worked just high quality. Also, the variety of topics and the variety of words per topic are vital for the result’s quality. I like to recommend trying different prompts, variety of topics, and variety of words per topic to search out what works best for a given document.

The code might be present in this link.

LEAVE A REPLY

Please enter your comment!
Please enter your name here