Home Artificial Intelligence arXiv Keyword Extraction and Evaluation Pipeline with KeyBERT and Taipy (1) Context (2) Tools Overview (3) Step-by-Step Guide Step 1 — Initial Setup Step 2 — Setup Configuration File Step 3 — Construct Functions Step 4 — Setup Taipy Core: Backend Config Step 5 — Setup Taipy GUI (Frontend) Step 6— Linking Backend and Frontend with Scenarios Step 7— Launching the Application (4) Wrapping it up Before you go

arXiv Keyword Extraction and Evaluation Pipeline with KeyBERT and Taipy (1) Context (2) Tools Overview (3) Step-by-Step Guide Step 1 — Initial Setup Step 2 — Setup Configuration File Step 3 — Construct Functions Step 4 — Setup Taipy Core: Backend Config Step 5 — Setup Taipy GUI (Frontend) Step 6— Linking Backend and Frontend with Scenarios Step 7— Launching the Application (4) Wrapping it up Before you go

0
arXiv Keyword Extraction and Evaluation Pipeline with KeyBERT and Taipy
(1) Context
(2) Tools Overview
(3) Step-by-Step Guide
Step 1 — Initial Setup
Step 2 — Setup Configuration File
Step 3 — Construct Functions
Step 4 — Setup Taipy Core: Backend Config
Step 5 — Setup Taipy GUI (Frontend)
Step 6— Linking Backend and Frontend with Scenarios
Step 7— Launching the Application
(4) Wrapping it up
Before you go

KeyBERT Taipy Kenneth Leung Data Science Machine Learning
Photo by Marylou Fortier on Unsplash

As the quantity of textual data from sources like social media, customer reviews, and online platforms grows exponentially, we must have the opportunity to make sense of this unstructured data.

Keyword extraction and evaluation are powerful natural language processing (NLP) techniques that enable us to attain that.

Keyword extraction involves routinely identifying and extracting essentially the most relevant words from a given text, while keyword evaluation involves analyzing the keywords to achieve insights into the underlying patterns.

On this step-by-step guide, we explore constructing a keyword extraction and evaluation pipeline and web app on arXiv abstracts using the powerful tools of KeyBERT and Taipy.

Contents

(1) Context
(2) Tools Overview
(3) Step-by-Step Guide
(4) Wrapping it up

Here is the accompanying GitHub repo for this text.

Given the rapid progress in artificial intelligence (AI) and machine learning research, keeping track of the numerous papers published each day may be difficult.

Regarding such research, arXiv is undoubtedly one among the leading sources of data. arXiv (pronounced ‘archive’) is an open-access archive hosting an unlimited collection of scientific papers covering various disciplines like computer science, mathematics, and more.

arXiv screenshot | Image used under CC 2.0 license

Considered one of the important thing features of arXiv is that it provides abstracts for every paper uploaded to its platform. These abstracts are a perfect data source as they’re concise, wealthy in technical vocabulary, and contain domain-specific terminology.

Hence, we’ll utilize the newest batches of arXiv abstracts because the text data to work on on this project.

The goal is to create an internet application (comprising a frontend interface and backend pipeline) where users can view the keywords and key phrases of arXiv abstracts based on specific input values.

Screenshot of the finished application user interface | Image by writer

There are three important tools that we are going to use on this project:

  • arXiv API Python wrapper
  • KeyBERT
  • Taipy

(i) arXiv API Python wrapper

The arXiv website offers public API access to maximise its openness and interoperability. For instance, to retrieve the text abstracts as a part of our Python workflow, we are able to use the Python wrapper for the arXiv API.

The arXiv API Python wrapper provides a set of functions for searching the database for papers that match specific criteria, resembling writer, keyword, category, and more.

It also lets users retrieve detailed metadata about each paper, resembling the title, abstract, authors, and publication date.

(ii) KeyBERT

KeyBERT (from the terms ‘keyword’ and ‘BERT’) is a Python library that gives an easy-to-use interface for using BERT embeddings and cosine similarity to extract the words in a document most representative of the document itself.

Illustration of how KeyBERT works | Image used under MIT License

The largest strength of KeyBERT is its flexibility. It allows users to simply modify the underlying settings (e.g., parameters, embeddings, tokenization) to experiment and fine-tune the keywords obtained.

On this project, we might be tuning the next set of parameters:

  • Variety of the highest keywords to be returned
  • Word n-gram range (i.e., minimum and maximum n-gram length)
  • Diversification algorithm (Max Sum Distance or Maximal Marginal Relevance) that determines how the similarity of extracted keywords is defined
  • Variety of candidates (if Max Sum Distance is ready)
  • Diversity value (if Maximal Marginal Relevance is ready)

Each diversification algorithms (Max Sum Distance and Maximal Marginal Relevance) share the identical basic idea of balancing two objectives: Retrieve results which might be highly relevant to the query and yet are diverse of their content to avoid redundancy amongst one another.

(iii) Taipy

Taipy is an open-source Python application builder that quickly lets developers and data scientists turn data and machine learning algorithms into complete web applications.

While designed to be a low-code library, Taipy also provides a high level of user customization. Subsequently, it’s well-suited for wide-ranging use cases, from easy dashboarding to production-ready industrial applications.

Taipy components | Image by writer

There are two key components of Taipy: Taipy GUI and Taipy Core.

  • Taipy GUI: A straightforward graphical user interface builder enabling us to simply create an interactive frontend app interface.
  • Taipy Core: A contemporary backend framework that lets us efficiently construct and execute pipelines and scenarios.

While we are able to use Taipy GUI or Taipy Core independently, combining each allows us to construct powerful applications efficiently.

As mentioned earlier within the Context section, we’ll construct an internet app that extracts and analyzes keywords of chosen arXiv abstracts.

The next diagram illustrates how the info and tools are integrated.

Overview of project | Image by writer

Allow us to start with the steps to create the above pipeline and web application in Python.

We start by pip installing the crucial Python libraries with corresponding versions shown below:

As quite a few parameters might be used, saving them inside a separate configuration file is good. The next YAML file config.yml accommodates the initial set of configuration parameter values.

With the configuration file arrange, we are able to then easily import these parameter values into our other Python scripts with the next code:

with open('config.yml') as f:
cfg = yaml.safe_load(f)

On this step, we’ll create a series of Python functions that form vital components of the pipeline. We create a brand new Python file functions.py to store these functions.

(3.1) Retrieve and Save arXiv Abstracts and Metadata

The primary function so as to add into functions.py is one for retrieving text abstracts from the arXiv database using the arXiv API Python wrapper.

Next, we write a function to store the abstract texts and corresponding metadata in a pandas DataFrame.

(3.2) Process Data

For the info processing step, we now have the next function to parse the abstract publication date into the suitable format while creating recent empty columns to store keywords.

(3.3) Run KeyBERT

We next create a function to run the KeyBert class from the KeyBERT library. The KeyBERT class is a minimal method for keyword extraction with BERT and is the simplest way for us to start.

There are numerous different methods for generating the BERT embeddings (e.g., Flair, Huggingface Transformers, and spaCy). On this case, we’ll use sentence-transformers as beneficial by the KeyBERT creator.

Particularly, we’ll use the defaultall-MiniLM-L6-v2 model because it provides an excellent balance of speed and quality.

The next function extracts the keywords from each abstract iteratively and saves them in the brand new DataFrame columns created within the previous step.

(3.4) Get Keywords Value Counts

Finally, we create a function that generates a price count of the keywords in order that we are able to plot the keyword frequencies in a chart later.

To orchestrate and link the backend pipeline flow, we’ll leverage the capabilities of Taipy Core.

Taipy Core offers an open-source framework to create, manage, and execute our data pipelines easily and efficiently. It has 4 fundamental concepts: Data Nodes, Tasks, Pipelines, and Scenarios.

4 fundamental concepts in Taipy Core | Image by writer

To establish the backend, we’ll use configuration objects (from the Config class) to model and define the characteristics and desired behavior of the abovementioned concepts.

(4.1) Data Nodes

As with most data science projects, we start by handling the info. In Taipy Core, we use Data Nodes to define the info we’ll work with.

We will consider Data Nodes as Taipy’s representation of knowledge variables. Nonetheless, as an alternative of storing the info directly, Data Nodes contain a set of instructions on the best way to retrieve the info needed.

Data Nodes can read and write a wide selection of knowledge types, resembling Python objects (e.g., str, int, list, dict, DataFrame, etc.), Pickle files, CSVs, SQL databases, and more.

Using the Config.configure_data_node() function, we define the Data Nodes for the keyword parameters based on the values from the configuration file in Step 2.

The id parameter sets the name of the Data Node, while the default_data parameter defines the default values.

We next include the configuration objects for the five sets of knowledge along the pipeline, as illustrated below:

Illustration of 5 Data Nodes along pipeline | Image by writer

The next code defines the five configuration objects:

(4.2) Tasks

Tasks in Taipy may be regarded as Python functions. We will define the configuration object for Tasks using the Config.configure_task().

We want to set five Task configuration objects corresponding to the five functions inbuilt Step 3.

Illustration of the five Tasks | Image by writer

The input and output parameters discuss with the input and output Data Nodes, respectively.

For instance, in task_process_data_cfg, the input is the Data Node for the raw pandas DataFrame containing the arXiv search results, while the output is the Data Node for the DataFrame storing processed data.

The skippable parameter, when set to True, indicates that the Task may be skipped if no changes have been made to the inputs.

Here is the flowchart of the Data Nodes and Tasks we now have defined thus far:

Data Nodes and Tasks flowchart | Image by writer

(4.3) Pipelines

A Pipeline is a series of Tasks that might be executed routinely by Taipy. It’s a configuration object comprising a sequence of Task configuration objects.

On this case, we’ll allocate the five Tasks into two Pipelines (one for data preparation and one for keyword evaluation) as illustrated below:

Tasks inside the two pipelines | Image by writer

We use the next code to define our two Pipeline configs:

As with all configuration objects, we assign a reputation to those Pipeline configurations using the id parameter.

(4.4) Scenarios

On this project, we aim to create an application that reflects the updated set of keywords (and corresponding evaluation) based on changes made to input parameters (e.g., N-gram length).

For that to occur, we leverage the powerful concept of Scenarios. Taipy Scenarios provide the framework for running Pipelines under different conditions, resembling when the user modifies the input parameters or data.

Scenarios also allow us to save lots of the outputs from the several inputs for simple comparison inside the same app interface.

Since we expect to do an easy sequential run of the Pipelines, we are able to place each Pipeline configs into the one Scenario configuration object.

Allow us to now switch gears and explore the frontend points of our application. Taipy GUI provides Python classes that make it easy to create powerful web app interfaces with text and graphical elements.

Pages are the premise for the user interface, they usually hold text, images, or controls that display information in the applying through visual elements.

There are two pages to create: (i) a keyword evaluation dashboard page and (ii) an information viewer page to display the keywords DataFrame.

(5.1) Data Viewer

Taipy GUI may be considered an augmented Markdown, meaning we are able to use the Markdown syntax to construct our frontend interface.

We start with the straightforward frontend page displaying the DataFrame of the extracted arXiv abstract data. The page is ready up in a Python script (named data_viewer_md.py) and storing the Markdown in a variable (called data_page).

The fundamental syntax for creating Taipy constructs in Markdown is using text fragments within the generic format of <|...|...|>.

Within the above Markdown, we pass our DataFrame object df together with table, which indicates a table element. With just these few lines of code, we get an output like the next:

Screenshot of the Data Viewer page | Image by writer

(5.2) Keyword Evaluation Dashboard

We now move to the important dashboard page of the applying, where we are able to make changes to the parameters and visualize the keywords obtained. The visual elements might be contained inside a Python script (named analysis_md.py)

This page has quite a few components, so let’s take it one step at a time. First, we instantiate the parameter values upon the loading of the applying.

Next, we define the input segment of the page where users could make changes to parameters and scenarios. This segment might be saved in a variable called input_page, and can eventually appear to be this:

Input segment of the Keyword Evaluation page | Image by writer

We create a seven-column layout within the Markdown in order that the input fields (e.g., text input, number input, dropdown menu selector) and buttons may be organized neatly.

We’ll explain the callback functions within the on_change and on_action parameters for the weather above, so there isn’t any have to worry about them for now.

After that, we define the output segment, where the frequency table and chart of the keywords based on the input parameters might be displayed.

Output segment of the Keyword Evaluation page | Image by writer

We’ll define the chart properties along with specifying the Markdown of the output segment within the variable output_page.

And within the last line above, we mix each input and output segments right into a single variable called analysis_page.

(5.3) Most important Landing Page

One last bit before our frontend interface is complete. Now that we now have each pages ready, we will display them on our important landing page.

The important page is defined inside important.py, which is the script that might be run when the applying is launched. The aim is to create a functional menu bar on the important page for users to toggle between the pages.

From the above code, we are able to see the state functionality of Taipy in motion, where the page is rendered based on the chosen page within the session state.

At this point, our frontend interface and backend pipeline have been arrange successfully. Nonetheless, we now have yet to link each of them together.

More specifically, we’ll have to create the Scenarios component in order that variations within the input parameters are processed within the pipeline, and the output is reflected within the dashboard.

The additional advantage of Scenarios is that each input-output set may be saved in order that users can refer back to those previous configurations.

We’ll define 4 functions to establish the Scenarios component, which might be stored within the analysis_md.py script:

(6.1) Update Chart

This function updates the keywords DataFrame, frequency count table, and corresponding bar chart based on the input parameters of the chosen Scenario stored within the session state.

(6.2) Submit Scenario

This function registers the updated set of input parameters the user has modified as a scenario and passes the values through the pipeline.

(6.3) Create Scenario

This function saves a scenario that has been executed in order that it will possibly be easily recreated and referred to again from the dropdown menu of created Scenarios.

(6.4) Synchronize GUI and Core

This function retrieves input parameters from a Scenario chosen from the dropdown menu of saved Scenarios and displays the resulting output within the frontend GUI.

Within the last step, we wrap up by completing the code in important.py in order that the Taipy launches and runs appropriately when the script is executed.

The above code does the next steps:

  • Instantiate Taipy Core
  • Setup scenario creation and execution
  • Retrieve keywords DataFrame and frequency count table
  • Launch Taipy GUI (with the required pages)

Finally, we are able to run python important.py within the Command Line, and the applying we now have built might be accessible on localhost:8020.

Frontend interface of accomplished application | Image by writer

The keywords related to a document offer concise and comprehensive indications of its subject material, highlighting crucial themes, concepts, ideas, or arguments contained therein.

In this text, we explored the best way to extract and analyze keywords of arXiv abstracts using KeyBERT and Taipy. We also discovered the best way to deliver these capabilities as an internet application comprising a frontend user interface and a backend pipeline.

Be happy to ascertain out the codes within the accompanying GitHub repo.

I welcome you to join me on an information science learning journey! Follow this Medium page and take a look at my GitHub to remain within the loop of more exciting practical data science content. Meanwhile, have a good time constructing your keyword extraction and evaluation pipeline with KeyBERT and Taipy!

LEAVE A REPLY

Please enter your comment!
Please enter your name here