RAG with a Local DBMS: Text-to-SQL using SQLite

What happens when the knowledge you need is not buried in documents, but neatly organized in database tables? In previous chapters we retrieved text chunks using vector similarity. Here we take a fundamentally different approach: instead of searching for similar text, we ask the LLM to write a SQL query that fetches the exact data we need. The database does the heavy lifting — joins, filters, aggregations — and the LLM simply translates English into SQL, then translates the results back into English.

How it works:

Build a SQLite database from the classic Supplier-Parts schema (DDL from a .sql file, data from an .xlsx file).
Extract the database schema as structured context — this is the “retrieval” step.
Given a natural-language question, the LLM generates a SQL query.
Execute the SQL and feed the results back to the LLM for a natural-language answer.

Learning goals:

Understand how RAG applies beyond vector databases
See how schema metadata serves as retrieval context
Practice grounding LLM output on real query results
Compare answers with and without database context

Provider setup: Uses llm_cascade for automatic LLM provider selection.
Store your LLM API keys in Colab Secrets (or a local .env file). Supported keys: GEMINI_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, HF_TOKEN, COHERE_API_KEY, XAI_API_KEY, OLLAMA_API_KEY, OPENROUTER_API_KEY.

If the text-RAG pipeline from Chapter 8’s rag_first_principles.ipynb is still fresh, keep it in mind as a contrast — there, retrieval meant fuzzy vector search over chunks; here retrieval is a precise SQL query, and the “chunks” are just the schema the LLM needs to write that query.

%pip install -q -U openpyxl git+https://github.com/KarAnalytics/llm_cascade.git sentence-transformers

1) Imports and Provider Helpers¶

SQLite is built into Python — no extra database driver needed.
We use openpyxl to load data from the Excel file.

The provider helpers below are identical to the pattern used in the RAG countries notebook: Gemini first, Ollama Cloud fallback, keys from Colab Secrets or .env.

import os
import sqlite3
import re
from pathlib import Path

import openpyxl

from llm_cascade import get_cascade

llm = get_cascade()


def generate_text(prompt, system_prompt=None, **kwargs):
    response = llm.generate(prompt, system_prompt=system_prompt)
    return response.text, response.provider


def has_llm_provider():
    return True  # llm_cascade auto-detects available providers


print("Provider ready (llm_cascade)")

2) Build the SQLite Database from External Files¶

We load the Supplier-Parts database from two companion files hosted in a public GitHub repo:

File	Source	Purpose
`SUPPLIER_PARTS_DDL.sql`	GitHub	CREATE TABLE statements (schema definition)
`SUPPLIER_PARTS.xlsx`	GitHub	Data for each table (one sheet per table)

The next cell downloads both files automatically. If you want to try your own database, uncomment Option B in the cell to upload local .sql and .xlsx files instead.

The demo database has three tables:

Table	Purpose
`SUPPLIERS`	Supplier ID, name, status, and city
`PARTS`	Part ID, name, color, weight, and city
`SHIPMENTS`	Which supplier ships which part, with quantity

This is the classic Supplier-Parts schema used in database textbooks — small enough to fit in a single prompt, but rich enough to demonstrate joins, aggregations, and foreign-key relationships.

import urllib.request

DB_PATH = 'supplier_parts.db'
DDL_FILE = 'SUPPLIER_PARTS_DDL.sql'
XLSX_FILE = 'SUPPLIER_PARTS.xlsx'

# =============================================================================
# Option A (default): download the demo files from GitHub
# =============================================================================
# This is the default so the notebook runs on a fresh Colab without needing
# any file uploads. Students can swap in their own DDL and data files by
# using Option B below instead.

DDL_URL = 'https://raw.githubusercontent.com/KarAnalytics/datasets/master/SUPPLIER_PARTS_DDL.sql'
XLSX_URL = 'https://github.com/KarAnalytics/datasets/raw/master/SUPPLIER_PARTS.xlsx'

print(f'Downloading {DDL_FILE}...')
urllib.request.urlretrieve(DDL_URL, DDL_FILE)
print(f'Downloading {XLSX_FILE}...')
urllib.request.urlretrieve(XLSX_URL, XLSX_FILE)
print('Downloaded both files.')
print()

# =============================================================================
# Option B: upload your own files instead
# =============================================================================
# Uncomment the lines below if you want to try a different database. Upload
# your .sql (DDL) file and .xlsx (data) file to the Colab workspace first,
# then update DDL_FILE / XLSX_FILE above to match your filenames.
#
# from google.colab import files
# uploaded = files.upload()  # a file picker will appear
# DDL_FILE = list(uploaded.keys())[0]       # e.g., 'my_schema.sql'
# XLSX_FILE = list(uploaded.keys())[1]      # e.g., 'my_data.xlsx'


# --- Build the SQLite database from the files ---
if Path(DB_PATH).exists():
    Path(DB_PATH).unlink()  # remove stale DB from previous runs

conn = sqlite3.connect(DB_PATH)
cur = conn.cursor()

# 1) Execute DDL from the .sql file
ddl_text = Path(DDL_FILE).read_text(encoding='utf-8')
cur.executescript(ddl_text)
print(f'Schema loaded from {DDL_FILE}')

# 2) Load data from Excel (one sheet per table)
wb = openpyxl.load_workbook(XLSX_FILE, read_only=True, data_only=True)

# Insertion order matters because of foreign-key constraints:
# SUPPLIERS and PARTS first, then SHIPMENTS.
table_order = ['SUPPLIERS', 'PARTS', 'SHIPMENTS']

for table_name in table_order:
    ws = wb[table_name]
    rows = list(ws.iter_rows(values_only=True))
    headers = rows[0]       # first row is column names
    data_rows = rows[1:]    # remaining rows are data
    placeholders = ', '.join('?' * len(headers))
    cur.executemany(
        f'INSERT INTO {table_name} VALUES ({placeholders})',
        data_rows,
    )
    print(f'  {table_name}: {len(data_rows)} rows inserted')

wb.close()
conn.commit()
print(f'\nDatabase "{DB_PATH}" ready.')

3) Schema Extraction — the “Retrieval” in DBMS RAG¶

This is the conceptual heart of the notebook, so pause and think about it carefully. In document RAG, the retrieval step searches a vector store for text chunks similar to the question. In DBMS RAG, we do something very different: we retrieve the database schema itself — the CREATE TABLE statements that describe what tables exist, what columns they have, and how they relate to each other.

Why the schema and not the data? Because the schema is small and fixed-size regardless of how many rows the database has. A database with 10 rows and a database with 10 billion rows have the same schema. By giving the LLM the schema (plus a few sample rows so it understands data formats), we equip it to write valid SQL queries that will fetch the exact data needed — without ever stuffing millions of rows into a prompt.

Think of it this way: in document RAG, retrieval is the answer (the chunks contain the evidence). In DBMS RAG, retrieval gives the LLM a map of the data, and the LLM uses that map to write a SQL query that goes and gets the actual evidence.

def get_schema_context(db_path, sample_rows=3):
    """Extract CREATE TABLE statements and sample rows as a text block."""
    con = sqlite3.connect(db_path)
    cur = con.cursor()

    # Get all CREATE TABLE DDL
    cur.execute("SELECT name, sql FROM sqlite_master WHERE type='table' ORDER BY name")
    tables = cur.fetchall()

    parts = []
    for table_name, ddl in tables:
        parts.append(f"-- Table: {table_name}")
        parts.append(ddl + ";")

        # Sample rows
        cur.execute(f"SELECT * FROM [{table_name}] LIMIT {int(sample_rows)}")
        rows = cur.fetchall()
        col_names = [desc[0] for desc in cur.description]
        parts.append(f"-- Sample rows ({table_name}): columns = {col_names}")
        for row in rows:
            parts.append(f"--   {row}")
        parts.append("")

    con.close()
    return "\n".join(parts)


schema_context = get_schema_context(DB_PATH)
print(schema_context)

4) Text-to-SQL Generation¶

This is the core RAG step: the LLM receives the schema context and the user’s natural-language question, then produces a SQL query. Notice the two-phase design: the first LLM call generates SQL (it does not answer the question directly), and we will execute that SQL against the real database before asking the LLM to compose a final answer. This separation is key — the LLM never sees the raw data during SQL generation, only the schema. It is essentially writing a recipe for how to find the answer.

Safety note: We only allow SELECT statements. The extract_sql function strips markdown fences and validates that the query starts with SELECT before execution, preventing any accidental DROP TABLE or UPDATE commands. In a production system you would also run queries with a read-only database connection.

SQL_SYSTEM_PROMPT = (
    "You are an expert SQL analyst. Given the database schema below, write a "
    "single SQLite-compatible SELECT query that answers the user's question.\n"
    "Return ONLY the SQL query — no explanation, no markdown fences, no comments.\n"
    "If the question cannot be answered from the schema, reply with: SELECT 'NOT_ANSWERABLE' AS result;"
)


def extract_sql(raw_text):
    """Strip markdown fences or extra prose, returning just the SQL statement."""
    match = re.search(r"```(?:sql)?\s*\n?(.*?)```", raw_text, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return raw_text.strip().rstrip(";")


def generate_sql(question, schema, **kwargs):
    """Ask the LLM to produce a SELECT query for the given question."""
    prompt = f"DATABASE SCHEMA:\n{schema}\n\nQUESTION:\n{question}"
    raw_response, provider = generate_text(
        prompt=prompt,
        system_prompt=SQL_SYSTEM_PROMPT,
    )
    sql = extract_sql(raw_response)
    return sql, provider


def execute_sql(db_path, sql):
    """Run a read-only SQL query and return column names + rows."""
    normalized = sql.strip().upper()
    if not normalized.startswith("SELECT"):
        return [], [], "Blocked: only SELECT queries are allowed."

    con = sqlite3.connect(db_path)
    try:
        cur = con.execute(sql)
        columns = [desc[0] for desc in cur.description]
        rows = cur.fetchall()
        return columns, rows, None
    except Exception as exc:
        return [], [], str(exc)
    finally:
        con.close()


print("Text-to-SQL functions ready.")

5) Answer Generation — Grounding on Query Results¶

After executing the SQL, we feed the actual query results back to the LLM and ask it to compose a natural-language answer. This is the second LLM call, and it follows the exact same pattern as document RAG: retrieved evidence goes into the prompt, and the model must base its answer on that evidence alone. The difference is that in document RAG the evidence was text chunks found by vector search; here the evidence is database rows returned by a SQL query.

This two-call architecture is what makes DBMS RAG so powerful for numerical questions. When you ask “What is the total quantity of all shipments?”, the LLM does not have to add numbers in its head (which it is bad at). Instead, it writes SELECT SUM(Quantity) FROM SHIPMENTS, the database computes the exact answer, and the LLM simply wraps that number in a sentence. The math is always right because SQL does the math.

ANSWER_SYSTEM_PROMPT = (
    "You are a helpful business analyst. Use ONLY the provided SQL results to answer "
    "the user's question. If the results are empty or the query returned NOT_ANSWERABLE, "
    "say you cannot determine the answer from the available data."
)


def answer_with_dbms_rag(question, db_path=DB_PATH, **kwargs):
    """Full DBMS-RAG pipeline: schema retrieval -> SQL generation -> execution -> answer."""
    schema = get_schema_context(db_path)

    print(f"  [Step 1] Schema retrieved ({len(schema)} chars)")

    sql, sql_provider = generate_sql(question, schema)
    print(f"  [Step 2] SQL generated (provider: {sql_provider})")
    print(f"           {sql}")

    columns, rows, error = execute_sql(db_path, sql)
    if error:
        print(f"  [Step 3] SQL execution error: {error}")
        return f"SQL failed: {error}", sql, sql_provider

    print(f"  [Step 3] SQL executed — {len(rows)} row(s) returned")

    result_text = f"Columns: {columns}\n"
    for row in rows:
        result_text += f"  {row}\n"

    prompt = (
        f"QUESTION:\n{question}\n\n"
        f"SQL QUERY USED:\n{sql}\n\n"
        f"QUERY RESULTS:\n{result_text}"
    )
    answer, ans_provider = generate_text(
        prompt=prompt,
        system_prompt=ANSWER_SYSTEM_PROMPT,
    )
    print(f"  [Step 4] Answer generated (provider: {ans_provider})")

    return answer, sql, sql_provider


def answer_without_rag(question, **kwargs):
    """Direct LLM answer with no database context."""
    answer, provider = generate_text(prompt=question)
    return answer, provider


print("RAG answer pipeline ready.")

6) Run End-to-End Examples¶

Time to see the full pipeline in action. We test several questions and compare two approaches side by side:

With DBMS RAG: schema retrieval, SQL generation, execution, and a grounded answer based on real query results.
Without RAG: the LLM answers the question directly with no database access.

Pay close attention to the “without RAG” answers. The model has no way to know that total shipment quantity is 5,300 — it has never seen this particular database. So it either asks for data it does not have (the honest response) or invents a plausible-sounding but completely wrong number (hallucination). The RAG version, by contrast, always gives the exact answer because the SQL engine computed it.

questions = [
    "What is the total quantity of all shipments?",
    #"Which supplier has shipped the most parts? Show their name and total quantity.",
    #"List all parts supplied by supplier S1, with quantities.",
    #"How many distinct parts are shipped from London?",
    #"What is the average shipment quantity per supplier located in Paris?",
]


def preview(text, max_len=800):
    text = text or ""
    return text[:max_len] + ("..." if len(text) > max_len else "")


if not has_llm_provider():
    print("Error: No LLM API key configured. Set at least one API key in Colab Secrets.")
else:
    for i, q in enumerate(questions, start=1):
        print("\n" + "=" * 80)
        print(f"Q{i}. {q}\n")

        try:
            answer_rag, sql_used, prov = answer_with_dbms_rag(q)
            print("\n  [WITH DBMS RAG]")
            print(preview(answer_rag))
        except Exception as e:
            print(f"  RAG error: {e}")

        try:
            answer_direct, prov_direct = answer_without_rag(q)
            print(f"\n  [WITHOUT RAG] (provider: {prov_direct})")
            print(preview(answer_direct))
        except Exception as e:
            print(f"  Direct error: {e}")

Checkpoint: Reflection Questions¶

Compare the RAG and non-RAG answers. Which ones show hallucinated numbers?
What happens if you ask a question about data that is not in the database (e.g., supplier revenue or profit)?
What are the risks of letting an LLM generate SQL against a production database?
How does this DBMS-RAG pattern compare to the vector-store RAG we used with ChromaDB?

7) How DBMS RAG Works — and How It Compares to Document RAG¶

This notebook implements RAG using a SQL database as the knowledge source instead of chunked documents. The pipeline has two retrieval phases, both appended to prompts before going to the LLM — so it’s structurally similar to document RAG, but with a twist.

The pipeline (per question)¶

Question
  |
  +-> Step 1: Retrieve SCHEMA (not data)
  |     get_schema_context() pulls CREATE TABLE statements
  |     + 3 sample rows per table
  |
  +-> Step 2: LLM generates SQL
  |     Prompt = schema + question -> LLM -> SELECT ...
  |
  +-> Step 3: Execute the SQL on the database
  |     Runs against SQLite, returns columns + rows
  |
  +-> Step 4: LLM writes the final answer
        Prompt = question + SQL + query results -> LLM -> natural language answer

Two LLM calls per question:

Text-to-SQL (generate_sql) — schema goes in, SQL query comes out
Answer generation (inside answer_with_dbms_rag) — query results go in, natural-language answer comes out

Is it different from document RAG?¶

Same core principle, different retrieval mechanism. Both patterns follow: retrieve -> stuff into prompt -> generate answer. The differences:

	Document RAG	DBMS RAG (this notebook)
What’s retrieved first?	Text chunks similar to the question	The database schema (DDL + sample rows)
How is retrieval done?	Vector similarity (cosine) on embeddings	Just dump all CREATE TABLE statements
Is question content matched against stored data?	Yes, directly (embedding search)	No — retrieval is schema, not data
When does the actual data enter the prompt?	Immediately (chunks = data)	In a second LLM call, after SQL executes
Scales with data size?	Gets slower (more chunks to search)	No — schema is fixed size even if the DB has billions of rows
Handles aggregations?	Poorly (LLM must do math in its head)	Perfectly (SQL `SUM`, `COUNT`, `AVG` are exact)

The key insight¶

In document RAG, retrieval and grounding happen in the same step — the chunks you retrieve ARE the evidence. Whatever the vector search brings back is what the LLM sees.

In DBMS RAG, retrieval happens twice:

First retrieval = schema only (what tables exist, what columns, what types)
Second “retrieval” = the SQL query results (the actual evidence, fetched dynamically)

This is why it’s so much better for structured data:

Document RAG has to hope the right chunks got retrieved. If your question is “total sales in Q3,” it retrieves chunks that mention Q3 sales — but the LLM has to add them up, and may miss rows or hallucinate.
DBMS RAG doesn’t hope. It asks the LLM to write SELECT SUM(amount) WHERE quarter='Q3', runs it, and shows the exact answer.

What does `get_schema_context()` actually retrieve?¶

Two things:

CREATE TABLE statements from SQLite’s sqlite_master metadata table. This is cheap — it returns the same few thousand characters regardless of whether the tables have 100 rows or 100 million.
3 sample rows per table. This gives the LLM a feel for the data format (e.g., “oh, dates are stored as strings like 2025-Q1, supplier IDs are integers, colors are lowercase”). These sample rows are the closest thing to “chunks” in DBMS RAG.

The second LLM call — where “actual retrieval” happens¶

After execute_sql() runs and returns rows, the notebook stuffs those rows INTO the prompt for the second LLM call:

prompt = (
    f'QUESTION:\n{question}\n\n'
    f'SQL QUERY USED:\n{sql}\n\n'
    f'QUERY RESULTS:\n{result_text}'
)

This is exactly the same pattern as document RAG — retrieved evidence + question → LLM → answer. The only difference is that the “retrieval” was done via a SQL query instead of a vector search.

TL;DR¶

Is DBMS RAG different from document RAG? Yes in how retrieval is done, no in the overall pattern.

Document RAG: vector_search(question) -> chunks -> prompt -> LLM -> answer
DBMS RAG: read_schema() -> prompt -> LLM -> SQL -> execute SQL -> results -> prompt -> LLM -> answer

Both pipelines ground the LLM’s final answer on retrieved evidence. DBMS RAG just retrieves via SQL execution instead of embedding similarity — which makes it perfect for structured/numeric data where aggregations and joins matter, and completely impractical for unstructured text where there’s no schema to query.

Where this notebook sits in the RAG family¶

Each type of RAG uses a different retrieval mechanism but the same pattern (retrieve evidence → stuff into prompt → ground the answer):

Type	Best for	Retrieval mechanism	Example notebook
Document RAG	Unstructured text (articles, PDFs)	Vector similarity	`LlamaIndex_RAG`, `LangChain_demo`
DBMS RAG (this one)	Structured tabular data	Text-to-SQL + execution	`DBMS_RAG_SQLite`
Graph RAG	Relational / network data	Text-to-Cypher + execution	`GRAPH_RAG_Trade`
Image / Video RAG	Visual content	CLIP embedding similarity	`Image_RAG`, `Video_RAG`

Key takeaways¶

RAG is not limited to vector databases — any structured data source can serve as retrieved context.
In DBMS RAG, the schema is the retrieval context and SQL execution is the grounding mechanism.
Grounding the final answer on actual query results prevents hallucination of specific numbers.
Loading schema from .sql files and data from .xlsx files mirrors real-world enterprise workflows.

Exercises¶

Add a new table (e.g., PROJECTS with a project-supplier-part relationship) and ask questions that require joins across it.
Try asking ambiguous questions and observe how the LLM interprets them.
Modify SQL_SYSTEM_PROMPT to request query explanations alongside the SQL.
Compare the quality of SQL generated by different models in llm_cascade (Gemini vs. Groq vs. Ollama).
Swap in a completely different database by uncommenting Option B in cell 5 and uploading your own .sql and .xlsx files.
Discuss: when would you prefer DBMS RAG over vector-store RAG for a business application?

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/DBMS_RAG_SQLite.ipynb