Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Graph RAG: Retrieval-Augmented Generation with Graph Databases

University of Kansas School of Business

Some data is inherently about relationships. Who trades with whom? Which countries are connected through supply chains? What is the shortest path between two trading partners? Relational databases can answer these questions, but the SQL gets ugly fast — you need self-joins, recursive CTEs, and complex subqueries. Graph databases are purpose-built for this kind of data: they store entities as nodes and connections as edges, and they answer relationship queries in a fraction of the code.

In this notebook we build a Graph RAG pipeline using Kuzu, a lightweight embedded graph database. The pattern mirrors what we did with SQL RAG — extract the schema, let the LLM write a query, execute it, ground the answer on results — but the query language is Cypher instead of SQL, and the data model is a graph instead of tables. The result is a system that excels at questions like “Which African countries export to both the USA and China?” — queries that would require multiple joins in SQL but are a single, readable pattern match in Cypher.

How it works:

  1. Build a graph database from international trade data (countries as nodes, trade flows as edges).

  2. Extract the graph schema (node labels, relationship types, properties) as LLM context.

  3. The LLM generates a Cypher query from a natural-language question.

  4. Execute the Cypher query and feed results back to the LLM for a natural-language answer.

  5. Visualize the retrieved subgraph to see the data structure.

Learning goals:

  • Understand why graph databases are better than SQL for relationship-heavy data

  • Learn the basics of Cypher, the graph query language

  • See how graph schema serves the same role as SQL DDL in DBMS RAG

  • Experience the limitations of LLM-generated Cypher (and how to debug it)

Provider setup: Uses llm_cascade for automatic LLM provider selection.
Store your LLM API keys in Colab Secrets (or a local .env file). Supported keys: GEMINI_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, HF_TOKEN, COHERE_API_KEY, XAI_API_KEY, OLLAMA_API_KEY, OPENROUTER_API_KEY.

Keep the vector-RAG notebooks from Chapters 8–9 and the SQL RAG demo in dbms_rag_sqlite.ipynb in mind as you read — Graph RAG is the third point in the same triangle, swapping vector search or SQL execution for Cypher pattern matching.

!pip install -q -U kuzu networkx matplotlib pandas git+https://github.com/KarAnalytics/llm_cascade.git sentence-transformers

1) Imports and Provider Helpers

We use Kuzu as our embedded graph database — it runs in-process (no server to install) and supports the Cypher query language, the same language used by Neo4j and other production graph databases. For visualization we use NetworkX and Matplotlib to render subgraphs as node-and-edge diagrams. The llm_cascade library handles LLM provider selection automatically, just as it did in the SQL RAG notebook.

import os
import re
from pathlib import Path

import kuzu
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

from llm_cascade import get_cascade

llm = get_cascade()


def generate_text(prompt, system_prompt=None, **kwargs):
    response = llm.generate(prompt, system_prompt=system_prompt)
    return response.text, response.provider


def has_llm_provider():
    return True  # llm_cascade auto-detects available providers


print("Provider ready (llm_cascade)")

2) Build the Graph Database

International trade data is a natural fit for a graph. Each country is a node with properties like name, continent, and population. Each trade flow is a directed edge from exporter to importer, carrying properties like the year and dollar value of exports. This structure makes it trivial to ask questions like “Who does Germany export to?” (follow outgoing edges) or “Which countries are trade hubs connecting two continents?” (find nodes with many incoming and outgoing edges).

We load the data from two CSV files hosted on GitHub:

FileRoleColumns
country_names_simple.csvNode DataID, Country, Continent, Population
edges_trade_2014to17_simple.csvRelationship DataYear, Source, Target, Type, export_val

The code below downloads both files, creates the Kuzu schema (one node table and one relationship table), and loads the data. Notice the parallel to how we built the SQLite database in the previous notebook — the steps are conceptually identical (define schema, load data), just expressed in graph terms instead of relational terms.

import shutil
import os
from pathlib import Path
import pandas as pd # Ensure pandas is imported

DB_FOLDER = "trade_db"
COUNTRY_CSV_URL = "https://raw.githubusercontent.com/KarAnalytics/InternationalTrade2017/main/country_names_simple.csv"
TRADE_CSV_URL = "https://raw.githubusercontent.com/KarAnalytics/InternationalTrade2017/main/edges_trade_2014to17_simple.csv"

LOCAL_COUNTRY_CSV = "country_names_simple.csv"
LOCAL_TRADE_CSV = "edges_trade_2014to17_simple.csv"

# 1. Clean up existing DB for a fresh run
db_path = Path(DB_FOLDER)

# Close any existing connections before attempting to remove files
# This helps release file locks from previous runs
if 'conn' in globals() and conn is not None: # Use globals() to check for existence across cell executions
    try:
        conn.close()
        print("Closed existing Kùzu connection.")
    except Exception as e:
        print(f"Error closing connection: {e}")
if 'db' in globals() and db is not None:
    try:
        db.close()
        print("Closed existing Kùzu database.")
    except Exception as e:
        print(f"Error closing database: {e}")

if db_path.exists():
    if db_path.is_dir():
        print(f"Removing existing database directory: {db_path}")
        shutil.rmtree(db_path)
    elif db_path.is_file():
        print(f"Removing existing database file: {db_path}")
        os.remove(db_path)

# Also explicitly clean up the .wal file if it exists, as Kùzu sometimes leaves it outside the main folder.
wal_file_path = Path(str(db_path) + ".wal")
if wal_file_path.is_file():
    print(f"Removing existing WAL file: {wal_file_path}")
    os.remove(wal_file_path)

# 2. Initialize Kùzu
db = kuzu.Database(DB_FOLDER)
conn = kuzu.Connection(db)

# 3. Create Schema
print("Defining schema...")
conn.execute("CREATE NODE TABLE Country(id STRING, name STRING, continent STRING, population INT64, PRIMARY KEY (id))")
conn.execute("CREATE REL TABLE TRADE(FROM Country TO Country, year INT64, type STRING, export_val DOUBLE)")

# 4. Load Node Data
print(f"Downloading and loading nodes from {COUNTRY_CSV_URL}...")
# Download CSV locally first as Kùzu COPY does not support URLs directly
df_country = pd.read_csv(COUNTRY_CSV_URL)
df_country.to_csv(LOCAL_COUNTRY_CSV, index=False)
conn.execute(f'COPY Country FROM "{LOCAL_COUNTRY_CSV}" (HEADER=TRUE)')
Path(LOCAL_COUNTRY_CSV).unlink() # Clean up local file

# 5. Load Relationship Data
# Note: Our CSV column order is (Year, Source, Target, Type, export_val).
# Kùzu's COPY REL expects the FROM/TO columns to match the relationship definition.
# We'll use pandas to rearrange the columns to (Source, Target, Year, Type, export_val)
# so Kùzu can match 'Source' to 'FROM' and 'Target' to 'TO'.
print(f"Downloading and loading relationships from {TRADE_CSV_URL}...")
df_trade = pd.read_csv(TRADE_CSV_URL)
cols = ['Source', 'Target', 'Year', 'Type', 'export_val']
df_trade[cols].to_csv(LOCAL_TRADE_CSV, index=False)
conn.execute(f'COPY TRADE FROM "{LOCAL_TRADE_CSV}" (HEADER=TRUE)')
Path(LOCAL_TRADE_CSV).unlink() # Clean up local file

print("\nGraph Database ready.")

3) Graph Schema Extraction — the “Retrieval” Step

Just as SQL RAG retrieves CREATE TABLE statements, Graph RAG retrieves the graph schema: what node labels exist, what properties they have, and what relationship types connect them. This schema plays the exact same role as the DDL in DBMS RAG — it is a compact description of the data’s structure that the LLM needs in order to write valid queries.

The key difference is representational. A SQL schema says “there is a table called SHIPMENTS with columns SID, PID, and Quantity.” A graph schema says “there is a relationship type TRADE from Country to Country with properties year, type, and export_val.” Both convey the same kind of information — what entities exist and how they connect — but the graph schema makes the connectivity explicit. The LLM reads this schema and learns: “Ah, countries are connected by TRADE edges, each edge has a year and a dollar value, and I can traverse these edges with Cypher patterns.”

def get_graph_schema(conn):
    """Extract graph schema for LLM context."""
    schema_parts = ["GRAPH SCHEMA:"]

    # Nodes
    schema_parts.append("Nodes:")
    nodes = conn.execute("CALL SHOW_TABLES() RETURN *").get_as_df()
    node_tables = nodes[nodes['type'] == 'NODE']['name'].tolist()
    for nt in node_tables:
        props = conn.execute(f"CALL TABLE_INFO('{nt}') RETURN *").get_as_df()
        prop_desc = ", ".join([f"{r['name']} ({r['type']})" for _, r in props.iterrows()])
        schema_parts.append(f"  - Label: {nt} | Properties: {prop_desc}")

    # Relationships
    schema_parts.append("Relationships:")
    rels = nodes[nodes['type'] == 'REL']['name'].tolist()
    for rt in rels:
        props = conn.execute(f"CALL TABLE_INFO('{rt}') RETURN *").get_as_df()
        prop_desc = ", ".join([f"{r['name']} ({r['type']})" for _, r in props.iterrows()])
        # Get connectivity
        # Kùzu SHOW_TABLES output for REL tables doesn't directly show FROM/TO in some versions,
        # but we know it's Country to Country here. For a dynamic demo, we'd query it.
        schema_parts.append(f"  - Type: {rt} | Properties: {prop_desc} | Connections: (Country)-[:{rt}]->(Country)")

    return "\n".join(schema_parts)

schema_context = get_graph_schema(conn)
print(schema_context)

4) Text-to-Cypher RAG Pipeline

Now we build the heart of Graph RAG: translating natural-language questions into Cypher queries. Cypher is to graph databases what SQL is to relational databases — a declarative query language — but its syntax is designed around pattern matching. Instead of writing SELECT ... FROM ... JOIN ... ON ..., you write patterns like (a:Country)-[:TRADE]->(b:Country) that visually resemble the graph structure you are searching for. This makes Cypher particularly intuitive for relationship queries: “find all countries that Germany exports to” becomes MATCH (deu:Country {id:'deu'})-[:TRADE]->(partner:Country) RETURN partner.name.

The pipeline follows the same two-call pattern as SQL RAG: the first LLM call generates the Cypher query, we execute it against Kuzu, and the second LLM call turns the raw results into a natural-language answer. The extract_cypher helper strips any markdown formatting the LLM might add, and we cap results at 50 rows to keep the final prompt manageable.

CYPHER_SYSTEM_PROMPT = (
    "You are a Cypher query expert. Given the graph schema below, write a read-only "
    "Cypher query to answer the user's question.\n"
    "- Output ONLY the raw Cypher query code.\n"
    "- Do not include markdown fences, triple backticks, or any explanation.\n"
    "- Return a maximum of 50 relevant nodes or relationships to keep context manageable.\n"
    "- If the question cannot be answered, return: MATCH (n) RETURN 'NOT_ANSWERABLE' AS result LIMIT 1;"
)

def extract_cypher(raw_text):
    # Strip markdown fences if any
    match = re.search(r"```(?:cypher)?\s*\n?(.*?)```", raw_text, re.DOTALL | re.IGNORECASE)
    if match: return match.group(1).strip()
    return raw_text.strip().rstrip(";")

def execute_cypher(conn, query):
    try:
        # Execute and get results
        res = conn.execute(query)
        cols = res.get_column_names()
        rows = []
        while res.has_next():
            rows.append(res.get_next())
        return cols, rows, None
    except Exception as e:
        return [], [], str(e)

def answer_with_graph_rag(question, conn):
    schema = get_graph_schema(conn)
    prompt = f"SCHEMA:\n{schema}\n\nQUESTION:\n{question}"

    # 1. Generate Cypher
    raw_cypher, prov_c = generate_text(prompt, system_prompt=CYPHER_SYSTEM_PROMPT)
    query = extract_cypher(raw_cypher)

    print(f"  [Cypher Generated]: {query}")

    # 2. Execute
    cols, rows, error = execute_cypher(conn, query)
    if error: return f"Query Error: {error}", query

    # 3. Grounded Answer
    results_text = f"Columns: {cols}\nRows: {rows[:10]} (truncated)"
    ans_prompt = f"QUESTION: {question}\n\nCYPHER QUERY: {query}\n\nRESULTS: {results_text}"
    answer, prov_a = generate_text(ans_prompt, system_prompt="Answer concisely based ONLY on the graph results.")
    return answer, query

5) Graph Visualization

One of the most compelling advantages of graph databases is that their data has a natural visual representation. Nodes are circles, edges are arrows, and you can literally see the structure of your data. The function below queries the immediate trade partners of a given country and renders the result as a directed graph using NetworkX. Edge labels show export values in dollars, making it easy to spot which trade relationships are the most significant. This kind of visualization is much harder to produce from relational data, where the connections are implicit in foreign keys rather than explicit in the data model.

def visualize_subgraph(conn, target_country_id, depth=1):
    """Visualize immediate trade partners of a country."""
    query = f"""
    MATCH (s:Country {{id: '{target_country_id}'}})-[r:TRADE]->(t:Country)
    RETURN s.name, t.name, r.export_val, r.year
    LIMIT 20
    """
    _, rows, _ = execute_cypher(conn, query)

    G = nx.DiGraph()
    for row in rows:
        s_name, t_name, val, year = row
        G.add_edge(s_name, t_name, weight=val, label=f"{year}")

    plt.figure(figsize=(10, 6))
    pos = nx.spring_layout(G, k=0.5)
    nx.draw(G, pos, with_labels=True, node_color='skyblue', node_size=2000, font_size=10, arrowsize=20)

    # Edge labels showing a bit of info
    edge_labels = {(u, v): f'${d["weight"]:,.0f}' for u, v, d in G.edges(data=True)}
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=8)

    plt.title(f"Trade Connections for {target_country_id} (Sample)")
    plt.show()

6) Run Examples

Let us put the pipeline to work. We ask the LLM a question that requires traversing the graph — finding Germany’s top trade partners by export value in 2014. The pipeline will generate a Cypher query with ORDER BY and LIMIT clauses, execute it, and return a grounded answer. We also visualize Germany’s trade subgraph so you can see the network of connections firsthand.

if not has_llm_provider():
    print("Configuration Error: Please set API keys.")
else:
    q1 = "Who were the top 3 trade partners of Germany (deu) in 2014 by export value?"
    print(f"\nQuestion: {q1}")
    ans, cypher = answer_with_graph_rag(q1, conn)
    print(f"Answer: {ans}")

    print("\nVisualizing subgraph for Germany...")
    visualize_subgraph(conn, "deu")

When Graph RAG Gets It Wrong

Graph RAG is not foolproof — and this section demonstrates exactly where it breaks down. The LLM must generate syntactically valid Cypher and use the correct property values. If the LLM writes {name: 'USA'} when the database stores 'United States', the query returns zero results even though the data is there. This is a fundamental challenge of text-to-query RAG: the LLM is guessing at data values it has never seen. Including sample data in the schema context (as we did in SQL RAG) can help, but it does not guarantee correctness. The next two queries illustrate this problem.

q2 = "Find countries in Africa that exported to both the USA (usa) and China (chn) in 2017."
print(f"\nQuestion: {q2}")
ans, cypher = answer_with_graph_rag(q2, conn)
print(f"Answer: {ans}")

The query above returned zero results, claiming no African countries export to both the USA and China. That is clearly wrong — we know from common sense that many African nations trade with both superpowers. The problem is that the LLM used {name: 'USA'} and {name: 'China'} in the Cypher query, but the database stores these countries as 'United States' and 'China' (or uses ISO codes). A simple name mismatch caused the entire query to silently return nothing. This is a critical lesson: zero results does not mean the answer is zero — it may mean the query is wrong.

q3 = "Which countries in Africa imported from USA (usa) in 2014."
print(f"\nQuestion: {q3}")
ans, cypher = answer_with_graph_rag(q3, conn)
print(f"Answer: {ans}")

Debugging: Writing Cypher by Hand

The results above looked suspicious, so let us do what any good analyst would do: write the queries ourselves and see what the data actually says. The cells below use the correct country name ('United States' instead of 'USA') and confirm that the data is there — 49 African countries imported from the US in 2014, and 48 African countries exported to both the US and China in 2017. The LLM’s Cypher was structurally correct but used the wrong property values, a common and instructive failure mode.

### Query: Which countries in Africa imported from USA (usa) in 2014.

# The Cypher query adjusted for your database schema:
# - Node labels are 'Country'
# - The property for the country name is 'name'
# - The relationship is 'TRADE'
query = """
MATCH (usa:Country)-[t:TRADE]->(af:Country)
WHERE usa.name = 'United States'
  AND af.continent = 'Africa'
  AND t.year = 2014
  AND t.export_val > 0
RETURN af.name AS countryName ;
"""

# Execute the query and return the results as a Pandas DataFrame
# This will display as a formatted table in the notebook output
results_df = conn.execute(query).get_as_df()
print(results_df.head())
print(results_df.shape)

### USA exports to almost all African countries each year, this is something I knew in advance.
### QUERY: Find countries in Africa that exported to both the USA (usa) and China (chn) in 2017.

# The Cypher query adjusted for your database schema:
# - Node labels are 'Country'
# - The property for the country name is 'name'
# - The relationship is 'TRADE'
query = """
MATCH (usa:Country {name: 'United States'}), (chn:Country {name: 'China'})
MATCH (af:Country {continent: 'Africa'})-[t1:TRADE {year: 2017}]->(usa)
MATCH (af)-[t2:TRADE {year: 2017}]->(chn)
WHERE t1.export_val > 0 AND t2.export_val > 0
RETURN af.name AS countryName
"""

# Execute the query and return the results as a Pandas DataFrame
# This will display as a formatted table in the notebook output
results_df = conn.execute(query).get_as_df()
print(results_df.head())
print(results_df.shape)

### Almost every African country exports to USA as well as China each year, this is something I knew in advance.

7) Reflection and Key Takeaways

Why graphs? Graph databases shine when your questions are about connections — who trades with whom, what paths exist between entities, which nodes are central to a network. The Cypher query language makes these patterns readable and concise. In SQL, a two-hop traversal requires nested joins; in Cypher, it is a single line: (a)-[:TRADE]->(b)-[:TRADE]->(c).

Graph schema = SQL DDL. Just as DBMS RAG retrieves CREATE TABLE statements, Graph RAG retrieves node labels and relationship types. Both serve the same purpose: giving the LLM enough structural knowledge to write valid queries. The schema is compact and fixed-size regardless of how much data the graph contains.

The value mismatch problem. As we saw, the LLM may generate syntactically correct Cypher that uses wrong property values (e.g., 'USA' vs. 'United States'). This is an active area of research. Solutions include: adding sample node data to the schema context, using fuzzy matching on property values, or adding a validation step that checks query results before presenting them to the user.

Exercises to try:

  1. Ask the LLM about multi-hop trade paths (e.g., “Which countries connect Germany to Brazil through a single intermediary?”).

  2. Modify the schema context to include sample node values and see if the LLM produces better queries.

  3. Compare the same question answered by SQL RAG (if you had the trade data in a relational table) vs. Graph RAG. Which query is more readable?

  4. Discuss: for what kinds of business data would you choose a graph database over a relational one?

Key takeaways

  • Graph databases model relationships natively -- Cypher patterns like (a)-[:TRADE]->(b) replace multi-join SQL for connection-heavy questions.

  • Graph schema retrieval (node labels, relationship types, properties) plays the same role for Graph RAG as DDL plays for SQL RAG -- compact structural context for query generation.

  • Text-to-Cypher follows the same two-call pattern as text-to-SQL: generate query, execute, then ground the natural-language answer on real results.

  • Value mismatches (LLM writes 'USA' when data stores 'United States') silently return zero rows -- zero results does NOT always mean zero; always sanity-check.

  • Visualization of retrieved subgraphs with NetworkX turns abstract query output into an interpretable picture of the underlying data.


Run the code

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~5 minutes (graph construction + Cypher generation)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/GRAPH_RAG_Trade.ipynb