Home Artificial Intelligence Leverage KeyBERT, HDBSCAN and Zephyr-7B-Beta to Construct a Knowledge Graph Introduction Data Preparation Keyword Extraction with KeyBERT and KeyLLM Clustering Keywords with HDBSCAN Extract Cluster Descriptions and Labels Construct the Knowledge Graph Conclusion References

Leverage KeyBERT, HDBSCAN and Zephyr-7B-Beta to Construct a Knowledge Graph Introduction Data Preparation Keyword Extraction with KeyBERT and KeyLLM Clustering Keywords with HDBSCAN Extract Cluster Descriptions and Labels Construct the Knowledge Graph Conclusion References

Leverage KeyBERT, HDBSCAN and Zephyr-7B-Beta to Construct a Knowledge Graph
Data Preparation
Keyword Extraction with KeyBERT and KeyLLM
Clustering Keywords with HDBSCAN
Extract Cluster Descriptions and Labels
Construct the Knowledge Graph

The work is completed in a Google Colab Pro with a V100 GPU and High RAM setting for the steps involving LLM. The notebook is split into self-contained sections, most of which will be executed independently, minimizing dependency on previous steps. Data is saved after each section, allowing continuation in a brand new session if needed. Moreover, the parsed dataset and the Python modules, are available on this Github repository.

I exploit a subset of the arXiv Dataset that’s openly available on the Kaggle platform and primarly maintained by Cornell University. In a machine readable format, it comprises a repository of 1.7 million scholarly papers across STEM, with relevant features corresponding to article titles, authors, categories, abstracts, full text PDFs, and more. It’s updated often.

The dataset is clean and in a straightforward to make use of format, so we will give attention to our task, without spending an excessive amount of time on data preprocessing. To further simplify the info preparation process, I built a Python module that performs the relevant steps. It might be found at utils/arxiv_parser.py if you desire to take a peek on the code, otherwise follow along the Google Colab:

  • download the zipped arXiv file (1.2 GB) within the directory of your alternative which is labelled data_path,
  • download the arxiv_parser.py within the directory utils,
  • import and initialize the module in your Google Colab notebook,
  • unzip the file, this may extract a 3.7 GB file: archive-metadata-oai-snapshot.json,
  • specify a general topic (I work with cs which stands for computer science), so that you’ll have a more maneagable size data,
  • select the features to maintain (there are 14 features within the downloaded dataset),
  • the abstracts can vary in length quite a bit, so I added the choice of choosing entries for which the variety of tokens within the abstract is in a given interval and used this feature to downsize the dataset,
  • although I decide to work with the title feature, there’s an choice to take the more common approach of concatenating the title and the abstact in a single feature denoted corpus .
# Import the info parser module
from utils.arxiv_parser import *

# Initialize the info parser
parser = ArXivDataProcessor(data_path)

# Unzip the downloaded file to extract a json file in data_path

# Select a subject and extract the articles on that topic
entries = parser.select_topic('cs')

# Construct a pandas dataframe with specified selections
df = parser.select_articles(entries, # extracted articles
cols=['id', 'title', 'abstract'], # features to maintain
min_length = 100, # min tokens an abstract must have
max_length = 120, # max tokens an abstract must have
keep_abs_length = False, # don't keep the abs_length column
build_corpus=False) # don't construct a corpus column

# Save the chosen data to a csv file 'selected_{topic}.csv', uses data_path

With the choices above I extract a dataset of 983 computer science articles. We’re able to move to the subsequent step.

If you desire to skip the info processing steps, you might use the cs dataset, available within the Github repository.

The Method

KeyBERT is a technique that extracts keywords or keyphrases from text. It uses document and word embeddings to search out the sub-phrases which might be most much like the document, via cosine similarity. KeyLLM is one other minimal method for keyword extraction nevertheless it relies on LLMs. Each methods are developed and maintained by Maarten Grootendorst.

The 2 methods will be combined for enhanced results. Keywords extracted with KeyBERT are fine-tuned through KeyLLM. Conversely, candidate keywords identified through traditional NLP techniques help grounding the LLM, minimizing the generation of undesired outputs.

For details on alternative ways of using KeyLLM see Maarten Grootendorst, Introducing KeyLLM — Keyword Extraction with LLMs.

— Diagram by writer —

Use KeyBERT [source] to extract keywords from each document — these are the candidate keywords provided to LLM to fine-tune:

  • documents are embedded using Sentence Transformers to construct a document level representation,
  • word embeddings are extracted for N-grams words/phrases,
  • cosine similarity is used to search out the words or phrases which might be most much like each document.

Use KeyLLM [source] to finetune the kewords extracted by KeyBERT via text generation with transformers [source]:

  • the community detection method in Sentence Transformers [source] groups the same documents, so we are going to extract keywords only from one document in each group,
  • the candidate keywords are provided the LLM which fine-tunes the keywords for every cluster.

Besides Sentence Transformers, KeyBERT supports other embedding models, see [here].

Sentence Transformers facilitate community detection by utilizing a specified threshold. When documents lack inherent clusters, clear groupings may not emerge. In my case, out of 983 titles, roughly 800 distinct communities were identified. More naturally clustered data tends to yield better-defined communities.

The Large Language Model

After experimting with various smaller LLMs, I select Zephyr-7B-Beta for this project. This model relies on Mistral-7B, and it’s one in every of the primary models fine-tuned with Direct Preference Optimization (DPO). It not only outperforms other models in its class but in addition surpasses Llama2–70B on some benchmarks. For more insights on this LLM take a have a look at Benjamin Marie, Zephyr 7B Beta: A Good Teacher is All You Need. Even though it’s feasible to make use of the model directly on a Google Colab Pro, I opted to work with a GPTQ quantized version prepared by TheBloke.

Start by downloading the model and its tokenizer following the instructions provided within the model card:

# Required installs
!pip install transformers optimum speed up
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

# Required imports
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load the model and the tokenizer
model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"

llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
revision="predominant") # change revision for a distinct branch
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,

Moreover, construct the text generation pipeline:

generator = pipeline(

The Keyword Extraction Prompt

Experimentation is vital on this step. Finding the optimal prompt requires some trial and error, and the performance is dependent upon the chosen model. Let’s not forget that LLMs are probabilistic, so it shouldn’t be guaranteed that they are going to return the identical output each time. To develop the prompt below, I relied on each experimentation and the next considerations:

prompt = "Tell me about AI"


And here is the prompt I exploit to fine-tune the keywords extracted with KeyBERT:

prompt_keywords= """
I even have the next document:
Semantics and Termination of Simply-Moded Logic Programs with Dynamic Scheduling
and five candidate keywords:
scheduling, logic, semantics, termination, moded

Based on the knowledge above, extract the keywords or the keyphrases that best describe the subject of the text.
Follow the necessities below:
1. Make certain to extract only the keywords or keyphrases that appear within the text.
2. Provide five keywords or keyphrases! Don't number or label the keywords or the keyphrases!
3. Don't include anything besides the keywords or the keyphrases! I repeat don't include any comments!

semantics, termination, simply-moded, logic programs, dynamic scheduling

I even have the next document:
and five candidate keywords:

Based on the knowledge above, extract the keywords or the keyphrases that best describe the subject of the text.
Follow the necessities below:
1. Make certain to extract only the keywords or keyphrases that appear within the text.
2. Provide five keywords or keyphrases! Don't number or label the keywords or the keyphrases!
3. Don't include anything besides the keywords or the keyphrases! I repeat don't include any comments!


Keyword Extraction and Parsing

We now have the whole lot needed to proceed with the keyword extraction. Let me remind you, that I work with the titles, so the input documents are short, staying well inside the token limits for the BERT embeddings.

Start with making a TextGeneration pipeline wrapper for the LLM and instantiate KeyBERT. Select the embedding model. If no embedding model is specified, the default model is all-MiniLM-L6-v2. On this case, I choose the highest-performant pretrained model for sentence embeddings, see here for a whole list.

# Install the required packages
!pip install keybert
!pip install sentence-transformers

# The required imports
from keybert.llm import TextGeneration
from keybert import KeyLLM, KeyBERT
from sentence_transformers import SentenceTransformer

# KeyBert TextGeneration pipeline wrapper
llm_tg = TextGeneration(generator, prompt=prompt_keywords)

# Instantiate KeyBERT and specify an embedding model
kw_model= KeyBERT(llm=llm_tg, model = "all-mpnet-base-v2")

Recall that the dataset was prepared and saved as a pandas dataframe df. To process the titles, just call the extract_keywords method:

# Retain the articles titles just for evaluation
titles_list = df.title.tolist()

# Process the documents and collect the outcomes
titles_keys = kw_model.extract_keywords(titles_list, thresold=0.5)

# Add the outcomes to df
df["titles_keys"] = titles_keys

The threshold parameter determines the minimum similarity required for documents to be grouped into the identical community. The next value will group nearly similar documents, while a lower value will cluster documents covering similar topics.

The alternative of embeddings significantly influences the suitable threshold, so it’s advisable to seek the advice of the model card for guidance. I’m grateful to Maarten Grootendorst for highlighting this aspect, as will be seen here.

It’s essential to notice that my observations apply exclusively to sentence transformers, as I haven’t experimented with other varieties of embeddings.

Let’s take a have a look at some outputs:


  • Within the second example provided here, we observe keywords or keyphrases not present in the unique text. If this poses an issue in your case, consider enabling check_vocab=True as done [here]. Nevertheless, it is vital to do not forget that these results are highly influenced by the LLM alternative, with quantization having a minor effect, in addition to the development of the prompt.
  • With longer input documents, I noticed more deviations from the required output.
  • One consistent remark is that the variety of keywords extracted often deviates from five. It’s common to come across titles with fewer extracted keywords, especially when the input is transient. Conversely, some titles yield as many as 10 extracted keywords. Let’s examine the distribution of keyword counts for this run:

These variations complicate the following parsing steps. There are a number of options for addressing this: we could investigate these cases intimately, request the model to revise and either trim or reiterate the keywords, or just overlook these instances and focus solely on titles with exactly five keywords, as I’ve decided to do for this project.

The next step is to cluster the keywords and keyphrases to disclose common topics across articles. To perform this I exploit two algorithms: UMAP for dimensionality reduction and HDBSCAN for clustering.

The Algorithms: HDBSCAN and UMAP

Hierarchical Density-Based Spatial Clustering of Applications with Noise or HDBSCAN, is a highly performant unsupervised algorithm designed to search out patterns in the info. It finds the optimal clusters based on their density and proximity. This is particularly useful in cases where the number and shape of the clusters could also be unknown or difficult to find out.

The outcomes of HDBSCAN clustering algorithm can vary should you run the algorithm multiple times with the identical hyperparameters. It’s because HDBSCAN is a stochastic algorithm, which suggests that it involves some extent of randomness within the clustering process. Specifically, HDBSCAN uses a random initialization of the cluster hierarchy, which may end up in different cluster assignments every time the algorithm is run.

Nevertheless, the degree of variation between different runs of the algorithm can depend upon several aspects, corresponding to the dataset, the hyperparameters, and the seed value used for the random number generator. In some cases, the variation could also be minimal, while in other cases it could be significant.

There are two clustering options with HDBSCAN.

  • The first clustering algorithm, denoted hard_clustering assigns each data point to a cluster or labels it as noise. This can be a hard task; there are not any mixed memberships. This approach might lead to one large cluster categorized as noise (cluster labelled -1) and diverse smaller clusters. Effective-tuning the hyperparameters is crucial [see here], because it is choosing an embedding model specifically tailored for the domain. Take a have a look at the associated Google Colab for the outcomes of hard clustering on the project’s dataset.
  • Soft clustering on the opposite side is a more recent feature of the HDBSCAN library. On this approach points should not assigned cluster labels, but as an alternative they’re assigned a vector of probabilities. The length of the vector is the same as the variety of clusters found. The probability value on the entry of the vector is the probability the purpose is a member of the the cluster. This enables points to potentially be a combination of clusters. If you desire to higher understand how soft clustering works please check with How Soft Clustering for HDBSCAN Works. This approach is best suited to the current project, because it generates a bigger set of moderately similar sizes clusters.

While HDBSCAN can perform well on low to medium dimensional data, the performance tends to diminish significantly as dimension increases. On the whole HDBSCAN performs best on as much as around 50 dimensional data, [see here].

Documents for clustering are typically embedded using an efficient transformer from the BERT family, leading to a several hundred dimensions data set.

To scale back the dimension of the embeddings vectors we use UMAP (Uniform Manifold Approximation and Projection), a non-linear dimension reduction algorithm and the very best performing in its class. It seeks to learn the manifold structure of the info and to search out a low dimensional embedding that preserves the essential topological structure of that manifold.

UMAP has been shown to be highly effective at preserving the general structure of high-dimensional data in lower dimensions, while also providing superior performance to other popular algorithms like t-SNE and PCA.

Keyword Clustering

  • Install and import the required packages and libraries.
# Required installs
!pip install umap-learn
!pip install hdbscan
!pip install -U sentence-transformers

# General imports
import pandas as pd
import numpy as np
import re
import pickle

# Imports needed to generate the BERT embeddings
from sentence_transformers import SentenceTransformer

# Libraries for dimensionality reduction
import umap.umap_ as umap

# Import the clustering algorithm
import hdbscan

  • Prepare the dataset by aggregating all keywords and keyphrases from each title’s individual quintet right into a single list of unique keywords and put it aside as a pandas dataframe.
# Load the info if needed - titles with 5 extracted keywords
df5 = pd.read_csv(data_path+parsed_keys_file)

# Create a listing of all sublists of keywords and keyphrases
df5_keys = df5.titles_keys.tolist()

# Flatten the list of sublists
flat_keys = [item for sublist in df5_keys for item in sublist]

# Create a listing of unique keywords
flat_keys = list(set(flat_keys))

# Create a dataframe with the distinct keywords
keys_df = pd.DataFrame(flat_keys, columns = ['key'])

I obtain almost 3000 unique keywords and keyphrases from the 884 processed titles. Here’s a sample: n-colorable graphs, experiments, constraints, tree structure, complexity, etc.

  • Generate 768-dimensional embeddings with Sentence Transformers.
# Instantiate the embedding model
model = SentenceTransformer('all-mpnet-base-v2')

# Embed the keywords and keyphrases into 768-dim real vector space
keys_df['key_bert'] = keys_df['key'].apply(lambda x: model.encode(x))

  • Perform dimensionality reduction with UMAP.
# Reduce to 10-dimensional vectors and keep the local neighborhood at 15
embeddings = umap.UMAP(n_neighbors=15, # Balances local vs. global structure.
n_components=10, # Dimension of reduced vectors

# Add the reduced embedding vectors to the dataframe
keys_df['key_umap'] = embeddings.tolist()

  • Cluster the 10-dimensional vectors with HDBSCAN. To maintain this blog succinct, I’ll omit descriptions of the parameters that pertain more to hard clustering. For detailed information on each parameter, please check with [Parameter Selection for HDBSCAN*].
# Initialize the clustering model
clusterer = hdbscan.HDBSCAN(algorithm='best',
cluster_selection_epsilon = .1,

# Fit the info

# Create soft clusters
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)

# Add the soft cluster information to the info
closest_clusters = [np.argmax(x) for x in soft_clusters]
keys_df['cluster'] = closest_clusters

Below is the distribution of keywords across clusters. Examination of the spread of keywords and keyphrases into soft clusters reveals a complete of 60 clusters, with a reasonably even distribution of elements per cluster, various from about 20 to almost 100.

Having clustered the keywords, we are actually able to employ GenAI over again to boost and refine our findings. At this step, we are going to use a LLM to research each cluster, summarize the keywords and keyphrases while assigning a transient label to the cluster.

While it’s not crucial, I decide to proceed with the identical LLM, Zephyr-7B-Beta. Do you have to require downloading the model, please seek the advice of the relevant section. Notably, I’ll adjust the prompt to suit the distinct nature of this task.

The next function is designed to extract a label and an outline for a cluster, parse the output and integrate it right into a pandas dataframe.

def extract_description(df: pd.DataFrame,
n: int
)-> pd.DataFrame:
Use a custom prompt to send to a LLM
to extract labels and descriptions for a listing of keywords.

one_cluster = df[df['cluster']==n]
one_cluster_copy = one_cluster.copy()
sample = one_cluster_copy.key.tolist()

prompt_clusters= f"""
I even have the next list of keywords and keyphrases:
['encryption','attribute','firewall','security properties',
'network security','reliability','surveillance','distributed risk factors',
'still vulnerable','cryptographic','protocol','signaling','safe',
'adversary','message passing','input-determined guards','secure communication',
'vulnerabilities','value-at-risk','anti-spam','intellectual property rights',
'countermeasures','security implications','privacy','protection',
'mitigation strategies','vulnerability','secure networks','guards']

Based on the knowledge above, first name the domain these keywords or keyphrases
belong to, secondly give a transient description of the domain.
Don't use greater than 30 words for the outline!
Don't provide details!
Don't give examples of the contexts, don't say 'corresponding to' and don't list the keywords
or the keyphrases!
Don't start with a press release of the shape 'These keywords belong to the domain of' or
with 'The domain'.

Cybersecurity: Cybersecurity, emphasizing methods and techniques for safeguarding digital information
and networks against unauthorized access and threats.

I even have the next list of keywords and keyphrases:
Based on the knowledge above, first name the domain these keywords or keyphrases belong to, secondly
give a transient description of the domain.
Don't use greater than 30 words for the outline!
Don't provide details!
Don't give examples of the contexts, don't say 'corresponding to' and don't list the keywords or the keyphrases!
Don't start with a press release of the shape 'These keywords belong to the domain of' or with 'The domain'.

# Generate the outputs
outputs = generator(prompt_clusters,

text = outputs[0]["generated_text"]

# Example string
pattern = "<|assistant|>n"

# Extract the output
response = text.split(pattern, 1)[1].strip(" ")
# Check if the output has the specified format
if len(response.split(":", 1)) == 2:
label = response.split(":", 1)[0].strip(" ")
description = response.split(":", 1)[1].strip(" ")
label = description = response

# Add the outline and the labels to the dataframe
one_cluster_copy.loc[:, 'description'] = description
one_cluster_copy.loc[:, 'label'] = label

return one_cluster_copy

Now we will apply the above function to every cluster and collect the outcomes:

import re
import pandas as pd

# Initialize an empty list to store the cluster dataframes
dataframes = []
clusters = len(set(keys_df.cluster))

# Iterate over the range of n values
for n in range(clusters-1):
df_result = extract_description(keys_df,n)

# Concatenate the person dataframes
final_df = pd.concat(dataframes, ignore_index=True)

Let’s take a have a look at a sample of outputs. For complete list of outputs please check with the Google Colab.

We must do not forget that LLMs, with their inherent probabilistic nature, will be unpredictable. While they often adhere to instructions, their compliance shouldn’t be absolute. Even slight alterations within the prompt or the input text can result in substantial differences within the output. Within the extract_description() function, I’ve incorporated a feature to log the response in each label and description columns in those cases where the Label: Description format shouldn’t be followed, as illustrated by the irregular output for cluster 7 above. The outputs for the whole set of 60 clusters can be found within the accompanying Google Colab notebook.

A second remark, is that every cluster is parsed independently by the LLM and it is feasible to get repeated labels. Moreover, there could also be instances of recurring keywords extracted from the input list.

The effectiveness of the method is extremely reliant on the alternative of the LLM and issues are minimal with a highly performant LLM. The output also is dependent upon the standard of the keyword clustering and the presence of an inherent topic inside the cluster.

Strategies to mitigate these challenges depend upon the cluster count, dataset characteristics and the required accuracy for the project. Listed here are two options:

  • Manually rectify each issue, as I did on this project. With only 60 clusters and merely three erroneous outputs, manual adjustments were made to correct the faulty outputs and to make sure unique labels for every cluster.
  • Employ an LLM to make the corrections, although this method doesn’t guarantee flawless results.

Data to Upload into the Graph

There are two csv files (or pandas dataframes if working in a single session) to extract the info from.

  • articles – it comprises unique id for every article, title , abstract and titles_keys which is the list of 5 extracted keywords or keyphrases;
  • keywords – with columns key , cluster , description and label , where key comprises a whole list of unique keywords or keyphrases, and the remaining features describe the cluster the keyword belongs to.

Neo4j Connection

To construct a knowledge graph, we start with organising a Neo4j instance, selecting from options like Sandbox, AuraDB, or Neo4j Desktop. For this project, I’m using AuraDB’s free version. It is simple to launch a blank instance and download its credentials.

Next, establish a connection to Neo4j. For convenience, I exploit a custom Python module, which will be found at [utils/neo4j_conn.py]() . This module comprises methods for connecting and interacting with the graph database.

# Install neo4j
!pip install neo4j

# Import the connector
from utils.neo4j_conn import *

# Graph DB instance credentials
URI = 'neo4j+ssc://xxxxxx.databases.neo4j.io'
USER = 'neo4j'
PWD = 'your_password_here'

# Establish the connection to the Neo4j instance
graph = Neo4jGraph(url=URI, username=USER, password=PWD)

The graph we’re about to construct has a straightforward schema consisting of three nodes and two relationships:

— Image by writer —

Constructing the graph now is simple with just two Cypher queries:

# Load Keyword and Topic nodes, and the relationships HAS_TOPIC
query_keywords_topics = """
UNWIND $rows AS row
MERGE (k:Keyword {name: row.key})
MERGE (t:Topic {cluster: row.cluster, description: row.description, label: row.label})
MERGE (k)-[:HAS_TOPIC]->(t)
graph.load_data(query_keywords_topics, keywords)

# Load Article nodes and the relationships HAS_KEY
query_articles = """
UNWIND $rows as row
MERGE (a:Article {id: row.id, title: row.title, abstract: row.abstract})
WITH a, row
UNWIND row.titles_keys as key
MATCH (k:Keyword {name: key})
MERGE (a)-[:HAS_KEY]->(k)
graph.load_data(query_articles, articles)

Query the Graph

Let’s check the distribution of the nodes and relationships on types:

We are able to find what individual topics (or clusters) are the preferred amongst our collection of articles, by counting the cumulative variety of articles associated to the keywords they’re connected to:

Here’s a snapshot of the node Semantics that corresponds to cluster 58 and its connected keywords:

— Image by writer —

We may also discover commonly occurring works in titles, using the query below:

We saw how we will structure and enrich a set of semingly unrelated short text entries. Using traditional NLP and machine learning, we first extract keywords after which we cluster them. These results guide and ground the refinement process performed by Zephyr-7B-Beta. While some oversight of the LLM remains to be neccessary, the initial output is significantly enriched. A knowledge graph is used to disclose the newly discovered connections within the corpus.

Our key takeaway is that no single method is ideal. Nevertheless, by strategically combining different techniques, acknowledging their strenghts and weaknesses, we will achieve superior results.

Google Colab Notebook and Code


Technical Documentation

Blogs and Articles

  • Maarten Grootendorst, Introducing KeyLLM — Keyword Extraction with LLMs, Towards Data Science, Oct 5, 2023.
  • Benjamin Marie, Zephyr 7B Beta: A Good Teacher Is All You Need, Towards Data Science, Nov 10, 2023.
  • The H4 Team, Zephyr: Direct Distillation of LM Alignment, Technical Report, arXiv: 2310.16944, Oct 25, 2023.


Please enter your comment!
Please enter your name here