Dynamic Prompting with LangChain Expression Language
On prompting strategies for Neo4j RAG application
I recently went through an experiment to create RAG application to chat with a graph database such as Neo4j with LLM. Langchain provides a framework to connect with Neo4j and hence I chose this framework. But, navigating across huge amount of articles around langchain can get confusing easily in many stages. One of them being the Prompt Templates. Though, langchain seems to be a powerful framework for creating such application, often times it becomes troublesome especially when the code/process breaks and you have no idea where to debug. One of the tools that I was interested in exploring was the langchain expression language (lcel). In this tutorial, I would like to give you a some tips on creating dynamic prompting through the use of chains and semantic similarity using lcel.
On a very high-level, this experiment was geared to perform the task as follow:
User Prompt → Vector Search →Generate Template → Graph Query
As always, a good prompting strategy is key for a good retrieval. My goal in this experiment was to utilize vector search and other custom tool to build a context around which to generate a prompt template. Below is a simple chain for my workflow.
chain = vector_search_tool | CYPHER_GENERATION_PROMPT | Graph_QA_Chain
The function ‘vector_search_tool’ invokes a Neo4jVector retriever chain and outputs a list of nodes. These nodes become a context around which the graph query is performed. In this blog, we will limit the chain operation up to the CYPHER_GENERATION_PROMPT only.
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided nodes, relationship and properties from the schema provided below:
{schema}
Do not use any other relationship types or properties that are not provided.
You may use context from the vector search to generate cypher statements and perform the query against the graph.
Here are the contexts:
{context}
Examples of Cypher Statement for graph query:
# What are the articles connected to the given titles?
MATCH (a:Article)-[:HAS_TITLE]->(t:Title)
WHERE a.article_id = 'W2036181141'
RETURN a.article_id, a.title
# Find articles for the given title and list their authors and institutions?
MATCH (a:Article)-[:HAS_TITLE]->(t:Title)
WHERE a.article_id = 'W2036181141'
MATCH (a)-[:WRITTEN_BY]->(au:Author)-[:AFFILLIATED_TO]->(inst:Institution)
RETURN a.title, au.author_names, inst.institution_name
The question is:
{question}
"""
As you can see in the template above, there are multiple variables to be supplied at invoke time: ‘schema’, ‘context’, ‘question’. Amongst these, the schema is a dynamic variable to be extracted from a Graph DB object. In a graph RAG application, it is beneficial to ground the query within the graph schema to avoid hallucinations.
An example of graph schema which is extracted dynamically from a Neo4jGraph object shown below. I will not go in the detail of Neo4j in this blog (I will post a much detail RAG workflow to chat with Neo4j using langchain, including agents). In case if you are curious what graph schema looks like:
schema = Neo4jGraph.get_schema
Node properties:
Article {journal_id: STRING, article_id: STRING, publication_year: INTEGER, ...}
Journal {journal_id: STRING, journal_name: STRING, ......... }
Year {publication_year: INTEGER, year_id: STRING,...}
....
The relationships:
(:Article)-[:YEAR_PUBLISHED]->(:Year)
(:Article)-[:PUBLISHED_IN]->(:Journal)
(:Article)-[:HAS_TITLE]->(:Title)
....
prompt_chain = (
{"context": vector_search_tool,
"question": RunnablePassthrough(),
"schema" : ????,
}
| CYPHER_GENERATION_PROMPT
)
The context variable is created by the vector similarity search after invoking with an user question. However, it may become tricky to supply the schema variable which is extracted dynamically from a Neo4jGraph object. Especially, if you are a newbie like me, you might be overwhelmed with the examples around the langchain documentation and not getting clear explanations on how to pass arguments during the invoke time. Most of the examples available in the internet shows that a typical chain is invoked with a format shown below:
chain.invoke('who is the author of an article?')
This left me puzzled, how or where do I supply the schema? One could get tempted to supply the schema like this,
prompt_chain = (
{"context": vector_search_tool,
"question": RunnablePassthrough(),
"schema" : Neo4jGraph.get_schema,
}
| CYPHER_GENERATION_PROMPT
)
however for lcel, it is not acceptable, because it only accepts a runnable type. In this case, I specifically wanted to create a runtime, because the graph schema may change overtime and it’s important to get the updated schema. After only embarrassingly spending several hours, if not days, I figured out that one could pass a dictionary of arguments during the invoke step.
args = {
"question": "find articles about oxidative stress",
"graph": graph
}
prompt_chain.invoke(args)
This seems so trivial, but it can easily take you down the rabbit hole, especially when the internet is flooded with examples that are too simple to apply to the real world problems. Anyway, now we can easily design our prompt templates as needed. One should also note that, when passing arguments as dictionary, the functions within the chain should be adjusted so that it receives the correct arguments. So we can define custom functions to parse the arguments in correct format so that the messages are passed with ease through the chains.
For examples:
def vector_search_tool(args):
chain_result = vector_graph_chain.invoke({
"query": args['question']},
)
chain_result = "do some thing to chain_re"
return chain_result
def arg_parser(args):
return args['question']
def get_graph_schema(args):
return args['graph'].get_schema
prompt_chain = (
{"context": vector_search_tool,
"question": arg_parser | RunnablePassthrough(),
"schema" : get_graph_schema,
}
| CYPHER_GENERATION_PROMPT
)
graph = Neo4jGraph(
url=AURA_CONNECTION_URI,
username="neo4j",
password="*************"
)
args = {
"question": "find articles about oxidative stress",
"graph": graph
}
prompt_result = prompt_chain.invoke(args)
print(prompt_result.text)
An example of generated prompt:
Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided nodes, relationship and properties from the schema provided below:
Node properties:
Article {journal_id: STRING, article_id: STRING, publication_year: INTEGER, journal_name: STRING, abstract: STRING, title: STRING, doi: STRING, topics: STRING, is_retracted: BOOLEAN, citation_count: INTEGER, twitter: FLOAT, in_citations: LIST, reddit: FLOAT, funders: LIST, title_vector: LIST, embedding_vector: LIST, embedding_vectors: LIST, title_vectors: LIST}
Journal {journal_id: STRING, journal_name: STRING, h_index: FLOAT, issn: LIST, sjr_best_quartile: STRING, sjr_score: FLOAT}
Year {publication_year: INTEGER, year_id: STRING}
Author {author_id: STRING, institution_name: STRING, author_names: STRING, institution_id: STRING}
Institution {institution_name: STRING, institution_id: STRING, country: STRING, institution_type: STRING, cited_by_count: FLOAT, city: STRING, institution_country_code: STRING, homepage_url: STRING, latitude: FLOAT, works_count: FLOAT, associated_institution: LIST, longitude: FLOAT}
Funder {h_index: INTEGER, funder_name: STRING, funder_id: STRING, homepage: STRING, country_code: STRING, grants_count: INTEGER, i10_index: INTEGER}
Country {country_name: STRING, country_id: STRING}
Abstract {text: STRING, abstract_id: STRING, abstract_vector: LIST}
Title {article_id: STRING, title_vector: LIST, title_id: STRING, text: STRING}
Topic {text: STRING, topic_id: STRING}
Relationship properties:
The relationships:
(:Article)-[:YEAR_PUBLISHED]->(:Year)
(:Article)-[:PUBLISHED_IN]->(:Journal)
(:Article)-[:HAS_TITLE]->(:Title)
(:Article)-[:HAS_ABSTRACT]->(:Abstract)
(:Article)-[:HAS_TOPIC]->(:Topic)
(:Article)-[:FUNDED_BY]->(:Funder)
(:Author)-[:WRITTEN_BY]->(:Article)
(:Author)-[:AFFILLIATED_TO]->(:Institution)
(:Institution)-[:IS_FROM]->(:Country)
(:Funder)-[:LOCATED_IN]->(:Country)
Do not use any other relationship types or properties that are not provided.
Do not return all node properties at once until specifically asked, only return the properties that are
relevant to the query.
If no properties specified, return node label, title and node id.
You may use context from the vector search to generate cypher statements and perform the query against the graph.
Some examples of contexts:
article_id: 'W2036181141', 'W3119010472'
author_id: 'A5021715023', 'A5006375773' etc..
Here are the contexts:
['W2036181141', 'W3119010472', 'W1997945482', 'W2117825019', 'W2127697129']
Examples of Cypher Statement for graph query:
# What are the articles connected to the given titles?
MATCH (a:Article)-[:HAS_TITLE]->(t:Title)
WHERE a.article_id = 'W2036181141'
RETURN a.article_id, a.title
# Who wrote the articles with the given titles?
MATCH (a:Article)-[:HAS_TITLE]->(t:Title)
WHERE a.article_id = 'W2036181141'
MATCH (a)-[:WRITTEN_BY]->(au:Author)
RETURN au.author_names
# Find articles for the given title and list their authors and institutions?
MATCH (a:Article)-[:HAS_TITLE]->(t:Title)
WHERE a.article_id = 'W2036181141'
MATCH (a)-[:WRITTEN_BY]->(au:Author)-[:AFFILLIATED_TO]->(inst:Institution)
RETURN a.title, au.author_names, inst.institution_name
The question is:
find articles about oxidative stress
Dynamic Prompting
When querying against the graph db, one can get same result from different Cypher Statements so we need to supply as many examples of Cypher statements in our prompt library. This means, in the production, one can store 100s if not 1000s of prompt templates as the nature of query varies over time. However, feeding all examples in the prompt will break the current token limit and many time irrelevant examples are provided to the LLM for reasoning. This may result into LLM responding with answers that are not relevant. Instead, we can use Semantic Similarity to dynamically select few prompts at invoke time that are most relevant to the user questions. In this way, we supply the templates that is semantically much relevant to the question. Langchain has some useful libraries to achieve this:
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.example_selectors import SemanticSimilarityExampleSelector, MaxMarginalRelevanceExampleSelector
Some example question/query to select from:
examples= [
{
"question": "What are the articles connected to the given titles?",
"query": "MATCH (a:Article)-[:HAS_TITLE]->(t:Title) WHERE a.article_id = 'W2036181141' RETURN a.article_id, a.title",
},
{
"question": "Find author of the articles that are connected to the given titles?",
"query": "MATCH (a:Article)-[:HAS_TITLE]->(t:Title) WHERE a.article_id = 'W2036181141' MATCH (a)-[:WRITTEN_BY]->(au:Author) RETURN au.author_names",
},
{
"question": "What are the articles connected to the given titles?",
"query": "MATCH (a:Article)-[:HAS_TITLE]->(t:Title) WHERE a.article_id = 'W2036181141' MATCH (a)-[:WRITTEN_BY]->(au:Author)-[:AFFILLIATED_TO]-(inst:Institution) RETURN a.title, au.author_names, inst.institution_name",
},
.
.
.
]
We can use Semantic Similarity Search or Max Marginal Relevance Search to select relevant prompts. Since our prompt library size would be tiny when compared to other documents, we can use vector database like chroma db to vectorize the prompts on the fly.
from langchain.vectorstores import Chroma
example_selector = MaxMarginalRelevanceExampleSelector.from_examples(
examples = examples,
embeddings = EMBEDDING_MODEL,
vectorstore_cls = Chroma,
k=3,
)
example_selector.select_examples({"question": "Find articles with a title"})
Result:
[{'query': "MATCH (a:Article)-[:HAS_TITLE]->(t:Title) WHERE a.article_id = 'W2036181141' RETURN a.article_id, a.title",
'question': 'What are the articles connected to the given titles?'},
{'query': "MATCH (a:Article) WHERE a.publication_year > 2010 WITH a ORDER BY a.citation_count DESC RETURN a.title, a.citation_count LIMIT 10",
'question': 'What are the top cited articles published after a specific year, e.g 2010?'},
{'query': "MATCH (i:Institution)-[:IS_FROM]-(c:Country) WHERE c.country_name='Japan' WITH i, c MATCH (i)-[:AFFILLIATED_TO]-(au:Author)-:WRITTEN_BY]-(a:Article) WITH a, au, c ORDER BY a.citation_count DESC RETURN DISTINCT(a.title) AS Title, a.citation_count AS Citations, COLLECT(au.author_names) LIMIT 10",
'question': 'List the top cited articles with authors from institution from a specific country, e.g Japan'}
]
We then supply this example_selector instance as an argument into the langchain FewShotPromptTemplate class.
# Configure a formatter
example_prompt = PromptTemplate(
input_variables=["question", "query"],
template="Question: {question}\nCypher query: {query}"
)
prefix = """
You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.
Here is the schema information:
Schema
Below are the few examples of questions and their corresponding Cypher queries:
"""
prompts = FewShotPromptTemplate(
example_selector = example_selector,
example_prompt = example_prompt,
prefix="Examples:",
suffix="Question: {question}",
input_variables =["question"],
)
# Check the results
print(prompts.format(question="Find top 10 most cited articles"))
Result:
You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.
Here is the schema information:
Schema
Below are the few examples of questions and their corresponding Cypher queries:
Question: What are the top 10 cited articles by a particular journal
Cypher query: MATCH (j:Journal)-[:PUBLISHED_IN]-(a:Article) WHERE j.journal_name='The Plant Journal' WITH a ORDER BY a.citation_count DESC RETURN a.title LIMIT 10
Question: What are the top cited articles published after a specific year, e.g 2010?
Cypher query: MATCH (a:Article) WHERE a.publication_year > 2010 WITH a ORDER BY a.citation_count DESC RETURN a.title, a.citation_count LIMIT 10
Question: What are the top 10 cited articles authored by authors affiliated with an institution?
Cypher query: MATCH (i:Institution)-[:AFFILLIATED_TO]-(au:Author)-[:WRITTEN_BY]-(a:Article) WHERE i.institution_name='University of California, Santa Barbara' WITH a ORDER BY a.citation_count DESC RETURN a.title, a.citation_count LIMIT 10
Question: Find top 10 most cited articles
Important Note:
When creating example for Cypher Query, care should be taken especially in the format. I have encountered a ‘key_error’ when using the following format in the some of the cypher example set:
MATCH (j:Journal {journal_name: 'foo'}) RETURN j
KeyError: journal_name not found...
My guess is that the ```FewShotPromptTemplate``` is not accepting middle braces { } in the string format as this is used to indicate input variable to be supplied in the chain.
It was very hard to debug this error!!! (this type of debugging is difficult with langchain).
To overcome this error, the example of Cypher statements has to be modified from above:
MATCH (j:Journal) WHERE j.journal_name ='foo' RETURN j
Finally, we would want to include the FewShotPromptTemplate step in our chain as follows:
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided nodes, relationship and properties from the schema provided below:
{schema}
Do not use any other relationship types or properties that are not provided.
You may use context from the vector search to generate cypher statements and perform the query against the graph.
Here are the contexts:
{context}
Examples of Cypher Statements are given below:
{example_selector}
The question is:
{question}
"""
CYPHER_GENERATION_PROMPT = PromptTemplate(
input_variables=["context", "question","schema", "example_selector"], template=CYPHER_GENERATION_TEMPLATE
)
prompt_chain = (
{"context": vector_search_tool,
"question": arg_parser | RunnablePassthrough(),
"schema" : get_graph_schema,
"example_selector" : example_selector
}
| CYPHER_GENERATION_PROMPT
)
args = {
"question": "find articles about oxidative stress",
"graph": graph
}
# define a function invoke the FewShotPromptTemplate
def example_selector(args):
prompt_results = prompts.invoke(args['question'])
return prompt_results.text
# Execute the chain
final_prompt = prompt_chain.invoke(args)
# Check the results:
print(final_prompt.text)
Final Results:
Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided nodes, relationship and properties from the schema provided below:
Node properties:
Article {journal_id: STRING, article_id: STRING, publication_year: INTEGER, journal_name: STRING, abstract: STRING, title: STRING, doi: STRING, topics: STRING, is_retracted: BOOLEAN, citation_count: INTEGER, twitter: FLOAT, in_citations: LIST, reddit: FLOAT, funders: LIST, title_vector: LIST, embedding_vector: LIST, embedding_vectors: LIST, title_vectors: LIST}
Journal {journal_id: STRING, journal_name: STRING, h_index: FLOAT, issn: LIST, sjr_best_quartile: STRING, sjr_score: FLOAT}
Year {publication_year: INTEGER, year_id: STRING}
Author {author_id: STRING, institution_name: STRING, author_names: STRING, institution_id: STRING}
Institution {institution_name: STRING, institution_id: STRING, country: STRING, institution_type: STRING, cited_by_count: FLOAT, city: STRING, institution_country_code: STRING, homepage_url: STRING, latitude: FLOAT, works_count: FLOAT, associated_institution: LIST, longitude: FLOAT}
Funder {h_index: INTEGER, funder_name: STRING, funder_id: STRING, homepage: STRING, country_code: STRING, grants_count: INTEGER, i10_index: INTEGER}
Country {country_name: STRING, country_id: STRING}
Abstract {text: STRING, abstract_id: STRING, abstract_vector: LIST}
Title {article_id: STRING, title_vector: LIST, title_id: STRING, text: STRING}
Chunk {id: STRING, embedding: LIST, text: STRING, question: STRING, query: STRING}
Topic {text: STRING, topic_id: STRING}
Relationship properties:
The relationships:
(:Article)-[:YEAR_PUBLISHED]->(:Year)
(:Article)-[:PUBLISHED_IN]->(:Journal)
(:Article)-[:HAS_TITLE]->(:Title)
(:Article)-[:HAS_ABSTRACT]->(:Abstract)
(:Article)-[:HAS_TOPIC]->(:Topic)
(:Article)-[:FUNDED_BY]->(:Funder)
(:Author)-[:WRITTEN_BY]->(:Article)
(:Author)-[:AFFILLIATED_TO]->(:Institution)
(:Institution)-[:IS_FROM]->(:Country)
(:Funder)-[:LOCATED_IN]->(:Country)
Do not use any other relationship types or properties that are not provided.
You may use context from the vector search to generate cypher statements and perform the query against the graph.
Here are the contexts:
['W2036181141', 'W3119010472', 'W1997945482', 'W2117825019', 'W2127697129']
Examples of Cypher Statements are given below:
Examples:
Question: What are the top 10 cited articles by a particular journal
Cypher query: MATCH (j:Journal)-[:PUBLISHED_IN]-(a:Article) WHERE j.journal_name='The Plant Journal' WITH a ORDER BY a.citation_count DESC RETURN a.title LIMIT 10
Question: Which articles were funded by a specific funder, e.g National Science Foundation?
Cypher query: MATCH (f:Funder)-[:FUNDED_BY]-(a:Article) WHERE f.funder_name = 'National Science Foundation' RETURN a.title
Question: What are the articles authored by authors affiliated with a University of Sheffield?
Cypher query: MATCH (i:Institution)<-[:AFFILLIATED_TO]-(au:Author)<-[:WRITTEN_BY]->(a:Article) WHERE i.institution_name = 'University of Sheffield' RETURN a.title
Question: find articles about oxidative stress
The question is:
find articles about oxidative stress
This is it. Now we can design your prompt by passing custom arguments using functions or other methods dynamically at run time.
I hope this tutorial finds you if you are struggling to create a custom prompt for your RAG application.
Thanks for reading!
References:
https://github.com/Coding-Crashkurse/LCEL-Deepdive/blob/main/lcel.ipynb
https://medium.com/@larry_nguyen/langchain-101-lesson-2-example-selectors-37b891ca9268