Construct a keyword evaluation Python application comprising a frontend user interface and backend pipeline
As the quantity of textual data from sources like social media, customer reviews, and online platforms grows exponentially, we must have the opportunity to make sense of this unstructured data.
Keyword extraction and evaluation are powerful natural language processing (NLP) techniques that enable us to attain that.
Keyword extraction involves routinely identifying and extracting essentially the most relevant words from a given text, while keyword evaluation involves analyzing the keywords to achieve insights into the underlying patterns.
On this step-by-step guide, we explore constructing a keyword extraction and evaluation pipeline and web app on arXiv abstracts using the powerful tools of KeyBERT and Taipy.
Contents
(1) Context
(2) Tools Overview
(3) Step-by-Step Guide
(4) Wrapping it up
Here is the accompanying GitHub repo for this text.
Given the rapid progress in artificial intelligence (AI) and machine learning research, keeping track of the numerous papers published each day may be difficult.
Regarding such research, arXiv is undoubtedly one among the leading sources of data. arXiv (pronounced ‘archive’) is an open-access archive hosting an unlimited collection of scientific papers covering various disciplines like computer science, mathematics, and more.
Considered one of the important thing features of arXiv is that it provides abstracts for every paper uploaded to its platform. These abstracts are a perfect data source as they’re concise, wealthy in technical vocabulary, and contain domain-specific terminology.
Hence, we’ll utilize the newest batches of arXiv abstracts because the text data to work on on this project.
The goal is to create an internet application (comprising a frontend interface and backend pipeline) where users can view the keywords and key phrases of arXiv abstracts based on specific input values.
There are three important tools that we are going to use on this project:
- arXiv API Python wrapper
- KeyBERT
- Taipy
(i) arXiv API Python wrapper
The arXiv website offers public API access to maximise its openness and interoperability. For instance, to retrieve the text abstracts as a part of our Python workflow, we are able to use the Python wrapper for the arXiv API.
The arXiv API Python wrapper provides a set of functions for searching the database for papers that match specific criteria, resembling writer, keyword, category, and more.
It also lets users retrieve detailed metadata about each paper, resembling the title, abstract, authors, and publication date.
(ii) KeyBERT
KeyBERT (from the terms ‘keyword’ and ‘BERT’) is a Python library that gives an easy-to-use interface for using BERT embeddings and cosine similarity to extract the words in a document most representative of the document itself.
The largest strength of KeyBERT is its flexibility. It allows users to simply modify the underlying settings (e.g., parameters, embeddings, tokenization) to experiment and fine-tune the keywords obtained.
On this project, we might be tuning the next set of parameters:
- Variety of the highest keywords to be returned
- Word n-gram range (i.e., minimum and maximum n-gram length)
- Diversification algorithm (Max Sum Distance or Maximal Marginal Relevance) that determines how the similarity of extracted keywords is defined
- Variety of candidates (if Max Sum Distance is ready)
- Diversity value (if Maximal Marginal Relevance is ready)
Each diversification algorithms (Max Sum Distance and Maximal Marginal Relevance) share the identical basic idea of balancing two objectives: Retrieve results which might be highly relevant to the query and yet are diverse of their content to avoid redundancy amongst one another.
(iii) Taipy
Taipy is an open-source Python application builder that quickly lets developers and data scientists turn data and machine learning algorithms into complete web applications.
While designed to be a low-code library, Taipy also provides a high level of user customization. Subsequently, it’s well-suited for wide-ranging use cases, from easy dashboarding to production-ready industrial applications.
There are two key components of Taipy: Taipy GUI and Taipy Core.
- Taipy GUI: A straightforward graphical user interface builder enabling us to simply create an interactive frontend app interface.
- Taipy Core: A contemporary backend framework that lets us efficiently construct and execute pipelines and scenarios.
While we are able to use Taipy GUI or Taipy Core independently, combining each allows us to construct powerful applications efficiently.
As mentioned earlier within the Context section, we’ll construct an internet app that extracts and analyzes keywords of chosen arXiv abstracts.
The next diagram illustrates how the info and tools are integrated.
Allow us to start with the steps to create the above pipeline and web application in Python.
We start by pip installing the crucial Python libraries with corresponding versions shown below:
As quite a few parameters might be used, saving them inside a separate configuration file is good. The next YAML file config.yml
accommodates the initial set of configuration parameter values.
With the configuration file arrange, we are able to then easily import these parameter values into our other Python scripts with the next code:
with open('config.yml') as f:
cfg = yaml.safe_load(f)
On this step, we’ll create a series of Python functions that form vital components of the pipeline. We create a brand new Python file functions.py
to store these functions.
(3.1) Retrieve and Save arXiv Abstracts and Metadata
The primary function so as to add into functions.py
is one for retrieving text abstracts from the arXiv database using the arXiv API Python wrapper.
Next, we write a function to store the abstract texts and corresponding metadata in a pandas DataFrame.
(3.2) Process Data
For the info processing step, we now have the next function to parse the abstract publication date into the suitable format while creating recent empty columns to store keywords.
(3.3) Run KeyBERT
We next create a function to run the KeyBert
class from the KeyBERT library. The KeyBERT
class is a minimal method for keyword extraction with BERT and is the simplest way for us to start.
There are numerous different methods for generating the BERT embeddings (e.g., Flair, Huggingface Transformers, and spaCy). On this case, we’ll use sentence-transformers as beneficial by the KeyBERT creator.
Particularly, we’ll use the defaultall-MiniLM-L6-v2
model because it provides an excellent balance of speed and quality.
The next function extracts the keywords from each abstract iteratively and saves them in the brand new DataFrame columns created within the previous step.
(3.4) Get Keywords Value Counts
Finally, we create a function that generates a price count of the keywords in order that we are able to plot the keyword frequencies in a chart later.
To orchestrate and link the backend pipeline flow, we’ll leverage the capabilities of Taipy Core.
Taipy Core offers an open-source framework to create, manage, and execute our data pipelines easily and efficiently. It has 4 fundamental concepts: Data Nodes, Tasks, Pipelines, and Scenarios.
To establish the backend, we’ll use configuration objects (from the Config
class) to model and define the characteristics and desired behavior of the abovementioned concepts.
(4.1) Data Nodes
As with most data science projects, we start by handling the info. In Taipy Core, we use Data Nodes to define the info we’ll work with.
We will consider Data Nodes as Taipy’s representation of knowledge variables. Nonetheless, as an alternative of storing the info directly, Data Nodes contain a set of instructions on the best way to retrieve the info needed.
Data Nodes can read and write a wide selection of knowledge types, resembling Python objects (e.g., str
, int
, list
, dict
, DataFrame
, etc.), Pickle files, CSVs, SQL databases, and more.
Using the Config.configure_data_node()
function, we define the Data Nodes for the keyword parameters based on the values from the configuration file in Step 2.
The id
parameter sets the name of the Data Node, while the default_data
parameter defines the default values.
We next include the configuration objects for the five sets of knowledge along the pipeline, as illustrated below:
The next code defines the five configuration objects:
(4.2) Tasks
Tasks in Taipy may be regarded as Python functions. We will define the configuration object for Tasks using the Config.configure_task()
.
We want to set five Task configuration objects corresponding to the five functions inbuilt Step 3.
The input
and output
parameters discuss with the input and output Data Nodes, respectively.
For instance, in task_process_data_cfg
, the input is the Data Node for the raw pandas DataFrame containing the arXiv search results, while the output is the Data Node for the DataFrame storing processed data.
The skippable
parameter, when set to True, indicates that the Task may be skipped if no changes have been made to the inputs.
Here is the flowchart of the Data Nodes and Tasks we now have defined thus far:
(4.3) Pipelines
A Pipeline is a series of Tasks that might be executed routinely by Taipy. It’s a configuration object comprising a sequence of Task configuration objects.
On this case, we’ll allocate the five Tasks into two Pipelines (one for data preparation and one for keyword evaluation) as illustrated below:
We use the next code to define our two Pipeline configs:
As with all configuration objects, we assign a reputation to those Pipeline configurations using the id
parameter.
(4.4) Scenarios
On this project, we aim to create an application that reflects the updated set of keywords (and corresponding evaluation) based on changes made to input parameters (e.g., N-gram length).
For that to occur, we leverage the powerful concept of Scenarios. Taipy Scenarios provide the framework for running Pipelines under different conditions, resembling when the user modifies the input parameters or data.
Scenarios also allow us to save lots of the outputs from the several inputs for simple comparison inside the same app interface.
Since we expect to do an easy sequential run of the Pipelines, we are able to place each Pipeline configs into the one Scenario configuration object.
Allow us to now switch gears and explore the frontend points of our application. Taipy GUI provides Python classes that make it easy to create powerful web app interfaces with text and graphical elements.
Pages are the premise for the user interface, they usually hold text, images, or controls that display information in the applying through visual elements.
There are two pages to create: (i) a keyword evaluation dashboard page and (ii) an information viewer page to display the keywords DataFrame.
(5.1) Data Viewer
Taipy GUI may be considered an augmented Markdown, meaning we are able to use the Markdown syntax to construct our frontend interface.
We start with the straightforward frontend page displaying the DataFrame of the extracted arXiv abstract data. The page is ready up in a Python script (named data_viewer_md.py
) and storing the Markdown in a variable (called data_page)
.
The fundamental syntax for creating Taipy constructs in Markdown is using text fragments within the generic format of <|...|...|>
.
Within the above Markdown, we pass our DataFrame object df
together with table
, which indicates a table element. With just these few lines of code, we get an output like the next:
(5.2) Keyword Evaluation Dashboard
We now move to the important dashboard page of the applying, where we are able to make changes to the parameters and visualize the keywords obtained. The visual elements might be contained inside a Python script (named analysis_md.py
)
This page has quite a few components, so let’s take it one step at a time. First, we instantiate the parameter values upon the loading of the applying.
Next, we define the input segment of the page where users could make changes to parameters and scenarios. This segment might be saved in a variable called input_page
, and can eventually appear to be this:
We create a seven-column layout within the Markdown in order that the input fields (e.g., text input, number input, dropdown menu selector) and buttons may be organized neatly.
We’ll explain the callback functions within the
on_change
andon_action
parameters for the weather above, so there isn’t any have to worry about them for now.
After that, we define the output segment, where the frequency table and chart of the keywords based on the input parameters might be displayed.
We’ll define the chart properties along with specifying the Markdown of the output segment within the variable output_page
.
And within the last line above, we mix each input and output segments right into a single variable called analysis_page
.
(5.3) Most important Landing Page
One last bit before our frontend interface is complete. Now that we now have each pages ready, we will display them on our important landing page.
The important page is defined inside important.py
, which is the script that might be run when the applying is launched. The aim is to create a functional menu bar on the important page for users to toggle between the pages.
From the above code, we are able to see the state functionality of Taipy in motion, where the page is rendered based on the chosen page within the session state.
At this point, our frontend interface and backend pipeline have been arrange successfully. Nonetheless, we now have yet to link each of them together.
More specifically, we’ll have to create the Scenarios component in order that variations within the input parameters are processed within the pipeline, and the output is reflected within the dashboard.
The additional advantage of Scenarios is that each input-output set may be saved in order that users can refer back to those previous configurations.
We’ll define 4 functions to establish the Scenarios component, which might be stored within the analysis_md.py
script:
(6.1) Update Chart
This function updates the keywords DataFrame, frequency count table, and corresponding bar chart based on the input parameters of the chosen Scenario stored within the session state.
(6.2) Submit Scenario
This function registers the updated set of input parameters the user has modified as a scenario and passes the values through the pipeline.
(6.3) Create Scenario
This function saves a scenario that has been executed in order that it will possibly be easily recreated and referred to again from the dropdown menu of created Scenarios.
(6.4) Synchronize GUI and Core
This function retrieves input parameters from a Scenario chosen from the dropdown menu of saved Scenarios and displays the resulting output within the frontend GUI.
Within the last step, we wrap up by completing the code in important.py
in order that the Taipy launches and runs appropriately when the script is executed.
The above code does the next steps:
- Instantiate Taipy Core
- Setup scenario creation and execution
- Retrieve keywords DataFrame and frequency count table
- Launch Taipy GUI (with the required pages)
Finally, we are able to run python important.py
within the Command Line, and the applying we now have built might be accessible on localhost:8020
.
The keywords related to a document offer concise and comprehensive indications of its subject material, highlighting crucial themes, concepts, ideas, or arguments contained therein.
In this text, we explored the best way to extract and analyze keywords of arXiv abstracts using KeyBERT and Taipy. We also discovered the best way to deliver these capabilities as an internet application comprising a frontend user interface and a backend pipeline.
Be happy to ascertain out the codes within the accompanying GitHub repo.
I welcome you to join me on an information science learning journey! Follow this Medium page and take a look at my GitHub to remain within the loop of more exciting practical data science content. Meanwhile, have a good time constructing your keyword extraction and evaluation pipeline with KeyBERT and Taipy!