Methods for creating fine-tuning datasets for text-to-Cypher generation.
Cypher is Neo4j’s graph query language. It was inspired and bears similarities with SQL, enabling data retrieval from knowledge graphs. Given the rise of generative AI and the widespread availability of enormous language models (LLMs), it’s natural to ask which LLMs are able to generating Cypher queries or how we will finetune our own model to generate Cypher from the text.
The difficulty presents considerable challenges, primarily attributable to the scarcity of fine-tuning datasets and, for my part, because such a dataset would significantly depend on the particular graph schema.
On this blog post, I’ll discuss several approaches for making a fine-tuning dataset geared toward generating Cypher queries from text. The initial approach is grounded in Large Language Models (LLMs) and utilizes a predefined graph schema. The second strategy, rooted entirely in Python, offers a flexible means to supply an enormous array of questions and Cypher queries, adaptable to any graph schema. For experimentation I created a knowledge graph that relies on a subset of the ArXiv dataset.
As I used to be finalizing this blogpost, Tomaz Bratanic launched an initiative project geared toward developing a comprehensive fine-tuning dataset that encompasses various graph schemas and integrates a human-in-the-loop approach to generate and validate Cypher statements. I hope that the insights discussed here can even be advantageous to the project.
I like working with the ArXiv dataset of scientific articles due to its clean, easy-to-integrate format for a knowledge graph. Utilizing techniques from my recent Medium blogpost, I enhanced this dataset with additional keywords and clusters. Since my primary focus is on constructing a fine-tuning dataset, I’ll omit the specifics of constructing this graph. For those interested, details will be present in this Github repository.
The graph is of an inexpensive size, featuring over 38K nodes and almost 96K relationships, with 9 node labels and eight relationship types. Its schema is illustrated in the next image:
While this data graph isn’t fully optimized and may very well be improved, it serves the needs of this blogpost quite effectively. In case you prefer to simply test queries without constructing the graph, I uploaded the dump file on this Github repository.
The primary approach I implemented was inspired by Tomaz Bratanic’s blogposts on constructing a knowledge graph chatbot and finetuning a LLM with H2O Studio. Initially, a choice of sample queries was provided within the prompt. Nevertheless, among the recent models have enhanced capability to generate Cypher queries directly from the graph schema. Subsequently, along with GPT-4 or GPT-4-turbo, there at the moment are accessible open source alternatives reminiscent of Mixtral-8x7B I anticipate could effectively generate decent quality training data.
On this project, I experimented with two models. For the sake of convenience, I made a decision to make use of GPT-4-turbo together with ChatGPT, see this Colab Notebook. Nevertheless, on this notebook I performed a number of tests with Mixtral-7x2B-GPTQ, a quantized model that’s sufficiently small to run on Google Colab, and which delivers satisfactory results.
To take care of data diversity and effectively monitor the generated questions, Cypher statements pairs, I even have adopted a two steps approach:
- Step 1: provide the total schema to the LLM and request it to generate 10–15 different categories of potential questions related to the graph, together with their descriptions,
- Step 2: provide schema information and instruct the LLM to create a particular number N of coaching pairs for every identified category.
Extract the categories of samples:
For this step I used ChatGPT Pro version, although I did iterate through the prompt several times, combined and enhanced the outputs.
- Extract a schema of the graph as a string (more about this in the following section).
- Construct a prompt to generate the categories:
chatgpt_categories_prompt = f"""
You might be an experienced and useful Python and Neo4j/Cypher developer.I even have a knowledge graph for which I would love to generate
interesting questions which span 12 categories (or types) in regards to the graph.
They need to cover single nodes questions,
two or three more nodes, relationships and paths. Please suggest 12
categories along with their short descriptions.
Here is the graph schema:
{schema}
"""
- Ask the LLM to generate the categories.
- Review, make corrections and enhance the categories as needed. Here’s a sample:
'''Authorship and Collaboration: Questions on co-authorship and collaboration patterns.
For instance, "Which authors have co-authored articles essentially the most?"''',
'''Article-Writer Connections: Questions on the relationships between articles and authors,
reminiscent of finding articles written by a particular creator or authors of a selected article.
For instance, "Find all of the authors of the article with tile 'Explorations of manifolds'"''',
'''Pathfinding and Connectivity: Questions that involve paths between multiple nodes,
reminiscent of tracing the connection path from an article to a subject through keywords,
or from an creator to a journal through their articles.
For instance, "How is the creator 'John Doe' connected to the journal 'Nature'?"'''
💡Suggestions💡
- If the graph schema could be very large, split it into overlapping subgraphs (this relies on the graph topology also) and repeat the above process for every subgraph.
- When working with open source models, select the most effective model you possibly can fit in your computational resources. TheBloke has posted an in depth list of quantized models, Neo4j GenAI provides tools to work on your individual hardware and LightningAI Studio is a recently released platform which provides you access to a mess of LLMs.
Generate the training pairs:
This step was performed with OpenAI API, working with GPT-4-turbo which also has the choice to output JSON format. Again the schema of the graph is supplied with the prompt:
def create_prompt(schema, category):
"""Construct and format the prompt."""
formatted_prompt = [
{"role": "system",
"content": "You are an experienced Cypher developer and a
helpful assistant designed to output JSON!"},
{"role": "user",
"content": f"""Generate 40 questions and their corresponding
Cypher statements about the Neo4j graph database with
the following schema:
{schema}
The questions should cover {category} and should be phrased
in a natural conversational manner. Make the questions diverse
and interesting.
Make sure to use the latest Cypher version and that all
the queries are working Cypher queries for the provided graph.
You may add values for the node attributes as needed.
Do not add any comments, do not label or number the questions.
"""}]
return formatted_prompt
Construct the function which is able to prompt the model and can retrieve the output:
def prompt_model(messages):
"""Function to supply and extract model's generation."""
response = client.chat.completions.create(
model="gpt-4-1106-preview", # work with gpt-4-turbo
response_format={"type": "json_object"},
messages=messages)
return response.selections[0].message.content
Loop through the categories and collect the outputs in an inventory:
def build_synthetic_data(schema, categories):
"""Function to loop through the categories and generate data."""# List to gather all outputs
full_output=[]
for category in categories:
# Prompt the model and retrieve the generated answer
output = [prompt_model(create_prompt(schema, category))]
# Store all of the outputs in an inventory
full_output += output
return full_output
# Generate 40 pairs for every of the categories
full_output = build_synthetic_data(schema, categories)
# Save the outputs to a file
write_json(full_output, data_path + synthetic_data_file)
At this point within the project I collected almost 500 pairs of questions, Cypher statements. Here’s a sample:
{"Query": "What articles have been written by 'John Doe'?",
"Cypher": "MATCH (a:Writer {first_name:'John', last_name:'Doe'})-
[:WRITTEN_BY]-(article:Article) RETURN article.title, article.article_id;"}
The information requires significant cleansing and wrangling. While not overly complex, the method is each time-consuming and tedious. Listed here are several of the challenges I encountered:
- non-JSON entries attributable to incomplete Cypher statements;
- the expected format is {’query’: ‘some query’, ‘cypher’:’some cypher’}, but deviations are frequent and must be standardized;
- instances where the questions and the Cypher statements are clustered together, necessiting their separation and organization.
💡Tip💡
It is best to iterate through variations of the prompt than trying to seek out the most effective prompt format from the start. In my experience, even with diligent adjustments, generating a big volume of information like this inevitably results in some deviations.
Now regarding the content. GPT-4-turbo is kind of capable to generate good questions on the graph, nevertheless not all of the Cypher statements are valid (working Cypher) and proper (extract the intended information). When fine-tuning in a production environment, I’d either rectify or eliminate these erroneous statements.
I created a function execute_cypher_queries()
that sends the queries to the Neo4j graph database . It either records a message in case of an error or retrieves the output from the database. This function is obtainable on this Google Colab notebook.
From the prompt, it’s possible you’ll notice that I instructed the LLM to generate mock data to populate the attributes values. While this approach is easier, it ends in quite a few empty outputs from the graph. And it demands extra effort to discover those statements involving hallucinatins, reminiscent of made-up attributes:
'MATCH (creator:Writer)-[:WRITTEN_BY]-(article:Article)-[:UPDATED]-
(updateDate:UpdateDate)
WHERE article.creation_date = updateDate.update_date
RETURN DISTINCT creator.first_name, creator.last_name;"
The Article
node has no creation_date
attribute within the ArXiv graph!
💡Tip💡
To attenuate the empty outputs, we could as an alternative extract instances directly from the graph. These instances can then be incorporated into the prompt, and instruct the LLM to make use of this information to complement the Cypher statements.
This method allows to create anywhere from a whole lot to a whole lot of 1000’s of correct Cypher queries, depending on the graph’s size and complexity. Nevertheless, it’s crucial to strike a balance bewteen the amount and the variety of those queries. Despite being correct and applicable to any graph, these queries can occasionally appear formulaic or rigid.
Extract Information Concerning the Graph Structure
For this process we want to begin with some data extraction and preparation. I exploit the Cypher queries and the among the code from the neo4j_graph.py module in Langchain.
- Connect with an existing Neo4j graph database.
- Extract the schema in JSON format.
- Extract several node and relationship instances from the graph, i.e. data from the graph to make use of as samples to populate the queries.
I created a Python class that perfoms these steps, it is obtainable at utils/neo4j_schema.py
within the Github repository. With all these in place, extracting the relevant data in regards to the graph necessitates a number of lines of code only:
# Initialize the Neo4j connector
graph = Neo4jGraph(url=URI, username=USER, password=PWD)
# Initialize the schema extractor module
gutils = Neo4jSchema(url=URI, username=USER, password=PWD)# Construct the schema as a JSON object
jschema = gutils.get_structured_schema
# Retrieve the list of nodes within the graph
nodes = get_nodes_list(jschema)
# Read the nodes with their properties and their datatypes
node_props_types = jschema['node_props']
# Check the output
print(f"The properties of the node Report are:n{node_props_types['Report']}")
>>>The properties of the node Report are:
[{'property': 'report_id', 'datatype': 'STRING'}, {'property': 'report_no', 'datatype': 'STRING'}]
# Extract an inventory of relationships
relationships = jschema['relationships']
# Check the output
relationships[:1]
>>>[{'start': 'Article', 'type': 'HAS_KEY', 'end': 'Keyword'},
{'start': 'Article', 'type': 'HAS_DOI', 'end': 'DOI'}]
Extract Data From the Graph
This data will provide authentic values to populate our Cypher queries with.
- First, we extract several node instances, this may retrieve all the information for nodes within the graph, including labels, attributes and their values :
# Extract node samples from the graph - 4 sets of node samples
node_instances = gutils.extract_node_instances(
nodes, # list of nodes to extract labels
4) # what number of instances to extract for every node
- Next, extract relationship instances, this includes all the information on the beginning node, the connection with its type and properties, and the top node information:
# Extract relationship instances
rels_instances = gutils.extract_multiple_relationships_instances(
relationships, # list of relationships to extract instances for
8) # what number of instances to extract for every relationship
💡Suggestions💡
- Each of the above methods work for the total lists of nodes, relationships or sublists of them.
- If the graph accommodates instances that lack records for some attributes, it’s advisable to gather more instances to make sure all possible scenarios are covered.
The subsequent step is to serialize the information, by replacing the Neo4j.time values with strings and put it aside to files.
Parse the Extracted Data
I confer with this phase as Python gymnastics. Here, we handle the information obtained within the previous step, which consists of the graph schema, node instances, and relationship instances. We reformat this data to make it easily accessible for the functions we’re developing.
- We first discover all of the datatypes within the graph with:
dtypes = retrieve_datatypes(jschema)
dtypes>>>{'DATE', 'INTEGER', 'STRING'}
- For every datatype we extract the attributes (and the corresponding nodes) which have that dataype.
- We parse instances of every datatype.
- We also process and filter the relationships in order that the beginning and the top nodes have attributes of specifid data types.
All of the code is obtainable within the Github repository. The explanations of doing all these will turn into transparent in the following section.
Easy methods to Construct One or One Thousand Cypher Statements
Being a mathematician, I often perceive statements by way of the underlying functions. Let’s consider the next example:
q = "Find the Topic whose description accommodates 'Jordan normal form'!"
cq = "MATCH (n:Topic) WHERE n.description CONTAINS 'Jordan normal form' RETURN n"
The above will be considered functions of several variables f(x, y, z)
and g(x. y, z)
where
f(x, y, z) = f"Find the {x} whose {y} accommodates {z}!"
q = f('Topic', 'description', 'Jordan normal form')g(x, y, z) = f"MATCH (n:{x}) WHERE n.{y} CONTAINS {z} RETURN n"
qc = g('Topic', 'description', 'Jordan normal form')
What number of queries of this sort can we construct? To simplify the argument let’s assume that there are N
node labels, each having in average n
properties which have STRING
datatype. So no less than Nxn
queries can be found for us to construct, not bearing in mind the choices for the string selections z
.
💡Tip💡
Simply because we’re capable of construct all these queries using a single line of code doesn’t imply that we must always incorporate the whole set of examples into our fine-tuning dataset.
Develop a Process and a Template
The principal challenge lies in making a sufficiently varied list of queries that covers a wide selection of facets related to the graph. With each proprietary and open-source LLMs able to generating basic Cypher syntax, our focus can shift to generating queries in regards to the nodes and relationships inside the graph, while omitting syntax-specific queries. To collect query examples for conversion into functional form, one could confer with any Cypher language book or explore the Neo4j Cypher documentation site.
Within the GitHub repository, there are about 60 varieties of these queries which might be then applied to the ArXiv knowledge graph. They’re versatile and applicable to any graph schema.
Below is the whole Python function for creating one set of comparable queries and incorporate it within the fine-tuning dataset:
def find_nodes_connected_to_node_via_relation():
def prompter(label_1, prop_1, rel_1, label_2):
subschema = get_subgraph_schema(jschema, [label_1, label_2], 2, True)
message = {"Prompt": "Convert the next query right into a Cypher query using the provided graph schema!",
"Query": f"""For every {label_1}, find the variety of {label_2} linked via {rel_1} and retrieve the {prop_1} of the {label_1} and the {label_2} counts in ascending order!""",
"Schema": f"Graph schema: {subschema}",
"Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m RETURN n.{prop_1} AS {prop_1}, count(m) AS {label_2.lower()}_count ORDER BY {label_2.lower()}_count"
}
return messagesampler=[]
for e in all_rels:
for k, v in e[1].items():
temp_dict = prompter(e[0], k, e[2], e[3])
sampler.append(temp_dict)
return sampler
- the function find_nodes_connected_to_node_via_relation() takes the generating prompter and evaluates it for all the weather in all_rels which is the gathering of extracted and processed relationship instances, whose entries are of the shape:
['Keyword',
{'name': 'logarithms', 'key_id': '720452e14ca2e4e07b76fa5a9bc0b5f6'},
'HAS_TOPIC',
'Topic',
{'cluster': 0}]
- the prompter inputs are two nodes denoted
label_1
andlabel_2
, the propertyprop_1
forlabel_1
and the connectionrel_1
, - the
message
accommodates the components of the prompt for the corresponding entry within the fine-tuning dataset, - the
subschema
extracts first neighbors for the 2 nodes denotedlabel_1
andlabel_2
, this implies: the 2 nodes listed, all of the nodes related to them (distance one within the graph), the relationships and all of the corresponding attributes.
💡Tip💡
Including the subschema
within the finetuning dataset just isn’t essential, although the more closely the prompt aligns with the fine-tuning data, the higher the generated output tends to be. From my perspective, incorporating the subschema within the fine-tuning data still offers benefits.
To summarize, post has explored various methods for constructing a fine-tuning dataset for generating Cypher queries from text. Here’s a breakdown of those techniques, together with their benefits and drawbacks:
LLM generated query and Cypher statements pairs:
- The strategy could appear straightforward by way of data collection, yet it often demands excessive data cleansing.
- While certain proprietary LLMs yield good outcomes, many open source LLMs still lack the proficiency of generating a wide selection of accurate Cypher statements.
- This system becomes burdensome when the graph schema is complex.
Functional approach or parametric query generation:
- This method is adaptable across various graphs schemas and allows for straightforward scaling of the sample size. Nevertheless, it can be crucial to be certain that the information doesn’t turn into overly repetitive and maintains diversity.
- It requires a major amount of Python programming. The queries generated can often seem mechanial and should lack a conversational tone.
To expand beyond these approaches:
- The graph schema will be seamlessley incorporated into the framework for creating the functional queries. Consider the next query, Cypher statement pair:
Query: Which articles were written by the creator whose last name is Doe?
Cypher: "MATCH (a:Article) -[:WRITTEN_BY]-> (:Writer {last_name: 'Doe') RETURN a"
As a substitute of using a direct parametrization, we could incorporate basic parsing (reminiscent of replacing WRITTEN_BY with written by), enhancing the naturalness of the generated query.
This highligts the importance of the graph schema’s design and the labelling of graph’s entities in the development of the fine-tuning pars. Adhering to general norms like using nouns for node labels and suggestive verbs for the relationships proves helpful and might create a more organically conversational link between the weather.
- Finally, it’s crucial to not overlook the worth of collecting actual user generated queries from graph interactions. When available, parametrizing these queries or enhancing them through other methods will be very useful. Ultimately, the effectiveness of this method relies on the particular objectives for which the graph has been designed.
To this end, it can be crucial to say that my focus was on simpler Cypher queries. I didn’t address creating or modifying data inside the graph, or the graph schema, nor I did include APOC queries.
Are there another methods or ideas you would possibly suggest for generating such fine-tuning query and Cypher statement pairs?
Code
Github Repository: Knowledge_Graphs_Assortment — for constructing the ArXiv knowledge graph
Github Repository: Cypher_Generator — for all of the code related to this blogpost
Data
• Repository of scholary articles: arXiv Dataset that has CC0: Public Domain license.