RAG Example (North America): ChromaDB + LLM Cascade

Every large language model has a knowledge cutoff -- a date beyond which it simply has not seen the world. Ask it about last quarter’s GDP figures or your company’s internal policies, and it will either confess ignorance or, worse, confidently hallucinate an answer. Retrieval-Augmented Generation (RAG) solves this by giving the model a research assistant: before answering, the system retrieves relevant documents from a knowledge base and injects them into the prompt as context. The model then generates its response grounded in real evidence rather than memory alone.

In this notebook we build a compact RAG pipeline over the CIA World Factbook -- a public-domain dataset that contains structured intelligence about every country on earth, from population statistics to trade balances to energy production. We scope the demo to North America only so that indexing runs in under a minute on free-tier hardware, but the architecture generalizes to any document corpus.

What you will learn:

How to ingest structured JSON and flatten it into retrieval-ready text documents
How recursive chunking with overlap preserves context continuity at chunk boundaries
How ChromaDB handles embedding and vector storage automatically, with no external API key required
How grounding an LLM on retrieved context changes its answers -- and where it still falls short

Why ChromaDB? Unlike cloud vector databases that require API keys and billing, ChromaDB runs entirely in-process. It pairs with a local sentence-transformer model (all-MiniLM-L6-v2) to embed your text, so the entire pipeline -- from ingestion to retrieval to generation -- works offline and for free. That makes it ideal for classroom demos where you want students focused on the concepts, not on credential management.

This notebook is the framework-assisted version of the pipeline you built by hand in Chapter 8’s rag_first_principles.ipynb — the seven manual steps are still there, but ChromaDB and LangChain hide most of the plumbing.

%pip install -q -U chromadb langchain-community langchain-text-splitters pandas requests git+https://github.com/KarAnalytics/llm_cascade.git sentence-transformers

1) Imports and Environment Setup¶

Before we build anything, we need our toolkit. The libraries below each play a distinct role in the RAG pipeline: pandas for tabular inspection, requests for fetching raw data from GitHub, chromadb for vector storage and retrieval, and RecursiveCharacterTextSplitter from LangChain for intelligent chunking. Notice that we do not import any embedding library explicitly -- ChromaDB will handle that internally when we configure it later.

API keys are resolved automatically. In Google Colab, you can store them in Secrets (e.g., GEMINI_API_KEY, OPENAI_API_KEY, GROQ_API_KEY). Outside Colab, the code falls back to environment variables. The llm_cascade library tries each configured provider in order, so you only need one working key to proceed.

import os
import pandas as pd
import requests
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter

from llm_cascade import get_cascade

llm = get_cascade()


def generate_text(prompt, system_prompt=None, **kwargs):
    response = llm.generate(prompt, system_prompt=system_prompt)
    return response.text, response.provider


def has_llm_provider():
    return True  # llm_cascade auto-detects available providers

2) Build a Minimal RAG Pipeline¶

A RAG pipeline has four distinct phases, each solving a different problem. First, data ingestion fetches raw documents and converts them into plain text that an embedding model can understand. Second, chunking splits long documents into smaller, overlapping pieces so that retrieval can zero in on the most relevant passage rather than returning an entire country profile. Third, vector storage embeds each chunk into a high-dimensional vector and indexes it for fast similarity search. Fourth, grounded generation retrieves the top-K most relevant chunks for a user question and feeds them to the LLM as context, constraining its answer to what the evidence actually says.

We will build each phase as a standalone function, then wire them together at the end. This modular design makes it easy to swap components -- for example, replacing ChromaDB with Pinecone, or changing the chunking strategy -- without rewriting the rest of the pipeline.

Function: `_collect_text_lines`¶

The CIA World Factbook stores each country as deeply nested JSON -- dictionaries inside dictionaries inside lists. A vector database, however, needs flat text strings to embed. This recursive helper walks the entire JSON tree, collects every leaf value, and pairs it with its full key path (e.g., Economy.GDP.real growth rate.text: 2.1%). The result is a list of human-readable lines that preserve the hierarchical context of each fact.

Why does this matter? If we simply dumped the raw JSON into ChromaDB, the embedding model would waste capacity on braces, brackets, and structural tokens that carry no semantic meaning. By flattening first, every token in the embedded text contributes to the meaning of the passage, which directly improves retrieval quality when a student later asks “What is Mexico’s GDP growth rate?”

# --- PHASE 1: DATA INGESTION (Pandas + GitHub) ---
def _collect_text_lines(obj, parent_key=""):
    """Recursively flatten nested JSON into readable key-value text lines."""
    lines = []

    if isinstance(obj, dict):
        for k, v in obj.items():
            next_key = f"{parent_key}.{k}" if parent_key else str(k)
            lines.extend(_collect_text_lines(v, next_key))
    elif isinstance(obj, list):
        for i, item in enumerate(obj):
            next_key = f"{parent_key}[{i}]" if parent_key else f"[{i}]"
            lines.extend(_collect_text_lines(item, next_key))
    else:
        if obj is None:
            return []
        text = str(obj).strip()
        if text:
            lines.append(f"{parent_key}: {text}")
    return lines

Function: `_get_country_json_paths`¶

The full CIA World Factbook covers over 250 countries and territories, which would take several minutes to download and index on free-tier hardware. For a classroom demo, speed matters more than completeness, so this function filters the GitHub repository tree to keep only files under north-america/. That gives us seven countries -- enough to demonstrate every RAG concept while keeping the total pipeline runtime under 60 seconds.

The function uses the GitHub Trees API to list all files in the repository without downloading them, then applies the region filter. The sorted, deduplicated output ensures that every student in the room gets the same document order, which makes it easier to compare results during class discussion.

def _get_country_json_paths():
    """Discover country JSON files, restricted to North America."""
    tree_url = "https://api.github.com/repos/factbook/factbook.json/git/trees/master?recursive=1"
    r = requests.get(tree_url, timeout=30)
    r.raise_for_status()
    tree = r.json().get("tree", [])

    allowed_prefixes = ("north-america/",)
    country_paths = []
    for item in tree:
        path = item.get("path", "")
        if item.get("type") == "blob" and path.endswith(".json") and "/" in path:
            if path.lower().startswith(allowed_prefixes):
                country_paths.append(path)

    # Deduplicate and sort for stable classroom demos.
    return sorted(set(country_paths))

Function: `get_world_data`¶

This is the orchestrator for the entire ingestion phase. It discovers country files, downloads each one, flattens the nested JSON into readable text using _collect_text_lines, and packages the result as a list of document dictionaries -- each containing a content string and a metadata dict with the country code and source path. The metadata travels alongside the content through chunking and indexing, so that when we retrieve a chunk later, we always know which country it came from.

Think of this function as the “ETL” step of the RAG pipeline: it extracts raw data from an external source, transforms it into a format suitable for embedding, and loads it into memory ready for the next phase.

def get_world_data():
    """Fetch all country records and convert each into one retrieval-ready full-text document."""
    print("Step 1: Discovering country files from GitHub...")
    base_url = "https://raw.githubusercontent.com/factbook/factbook.json/master"
    country_paths = _get_country_json_paths()
    print(f"Discovered {len(country_paths)} country files.")

    docs = []
    for path in country_paths:
        try:
            r = requests.get(f"{base_url}/{path}", timeout=20)
            r.raise_for_status()
            payload = r.json()
        except Exception as exc:
            print(f"Skipping {path}: {exc}")
            continue

        lines = _collect_text_lines(payload)
        if not lines:
            continue

        country_code = path.split("/")[-1].replace(".json", "").upper()
        combined_text = (
            f"COUNTRY_CODE: {country_code}\n"
            f"SOURCE_PATH: {path}\n\n"
            + "\n".join(lines)
        )

        docs.append(
            {
                "content": combined_text,
                "metadata": {
                    "country": country_code,
                    "path": path,
                },
            }
        )

    print(f"Loaded {len(docs)} country documents.")
    return docs

Function: `chunk_documents`¶

A single country document can run to thousands of lines -- far too long for an embedding model to capture in one vector. Chunking solves this by splitting each document into smaller, digestible pieces. We use LangChain’s RecursiveCharacterTextSplitter, which tries to break text at natural boundaries (paragraph breaks, then sentence boundaries, then word boundaries) before falling back to a hard character limit.

The two key parameters are chunk_size (800 characters by default) and chunk_overlap (100 characters). The overlap is crucial: without it, a fact that straddles two chunks would be split in half, and neither chunk would contain the complete information. Experimenting with these values is one of the most impactful tuning knobs in any RAG system.

## --- PHASE 2: RECURSIVE CHUNKING ---
def chunk_documents(docs, chunk_size=800, chunk_overlap=100):
    """Chunk long records so semantic retrieval can match smaller relevant spans."""
    print("Step 2: Chunking with overlap...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )

    chunks, metadatas, ids = [], [], []
    for doc in docs:
        split_texts = splitter.split_text(doc["content"])
        for i, text in enumerate(split_texts):
            chunks.append(text)
            metadatas.append(doc["metadata"])
            ids.append(f"{doc['metadata']['country']}_{i}")
    return chunks, metadatas, ids


# Small inspection check for students before moving on.
docs_preview = get_world_data()
print(f"Loaded {len(docs_preview)} source documents.")
print("Example metadata:", docs_preview[0]["metadata"] if docs_preview else "No documents loaded")

Checkpoint 1: Data and Chunking Reflection¶

Pause here and consider these questions before moving on. They highlight the design decisions that most affect RAG quality downstream.

Why do we transform structured JSON fields into one text document per country before retrieval? What would go wrong if we indexed the raw JSON directly?
How might changing chunk_size from 800 to 300 affect precision and recall? Would the answer to “What is Canada’s population?” improve or degrade?
What is the purpose of overlap, and what failure mode does it reduce?

Function: `build_vector_store`¶

This function wires together the embedding model selection, collection management, and batch upsert logic. We explicitly specify all-MiniLM-L6-v2 as the embedding model -- a compact, fast sentence-transformer that runs locally without any API key. It produces 384-dimensional vectors and handles English text well enough for our factbook data.

The function deletes any existing collection before creating a fresh one. This is a classroom convenience: it prevents the confusing situation where a student runs the notebook twice and ends up with duplicate chunks that skew retrieval results. In production, you would instead use versioned collections or incremental upserts with deduplication logic.

def build_vector_store(docs, db_path="./factbook_db", collection_name="world_facts", batch_size=5000):
    """Create (or reset) a Chroma collection and upsert chunked records in batches."""
    chunks, metadatas, ids = chunk_documents(docs)

    chroma_client = chromadb.PersistentClient(path=db_path)

    # Explicitly specify the embedding model that converts text chunks into vectors
    from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
    embedding_fn = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
    print("  Embedding model: all-MiniLM-L6-v2 (local, no API key needed)")

    # Delete any existing collection to avoid embedding function conflicts, then create fresh
    try:
        chroma_client.delete_collection(name=collection_name)
    except Exception:
        pass
    collection = chroma_client.create_collection(
        name=collection_name,
        embedding_function=embedding_fn,
    )

    print("Step 3: Storing vectors in ChromaDB...")
    for start in range(0, len(ids), batch_size):
        stop = start + batch_size
        collection.upsert(
            documents=chunks[start:stop],
            metadatas=metadatas[start:stop],
            ids=ids[start:stop],
        )

    print(f"Stored {len(ids)} chunks in collection '{collection_name}'.")
    return collection

collection = build_vector_store(docs_preview)

2B) Retrieve Context and Generate a Grounded Answer¶

This is where the “retrieval” meets the “generation” in RAG. When a user asks a question, we first embed the question using the same sentence-transformer model and find the top-K nearest chunks in the vector store. Those chunks become the context paragraph in the LLM prompt, and a carefully worded system instruction tells the model to answer only from the provided context -- if the answer is not there, it should say so rather than guessing.

This retrieve-then-generate pattern is the beating heart of every RAG system. The retrieval step acts as an evidence filter, dramatically narrowing the LLM’s attention from its entire training corpus down to a handful of relevant passages. The prompting constraint -- “use ONLY the provided context” -- further reduces hallucination by giving the model explicit permission to say “I don’t know.” Together, these two mechanisms produce answers that are both relevant and verifiable: a human can inspect the retrieved chunks to confirm whether the model’s claims are actually supported by the evidence.

Functions: `answer_with_rag` and `answer_without_rag`¶

These two functions are the core of our teaching comparison. answer_with_rag performs the full retrieve-then-generate pipeline: it queries ChromaDB for the top-K chunks, assembles them into a context block, and prompts the LLM to answer only from that evidence. answer_without_rag skips retrieval entirely and sends the bare question to the LLM, relying solely on whatever the model memorized during pre-training.

Running the same question through both functions reveals exactly what RAG adds -- and what it costs. The RAG answer will be grounded in specific, verifiable data from the Factbook (exchange rates, population figures, trade balances), while the non-RAG answer will draw on the model’s general knowledge, which may be outdated, vague, or simply wrong. Comparing them side by side is the single most effective way to build intuition for when RAG is worth the engineering overhead and when a bare LLM call suffices.

def answer_with_rag(
    question,
    collection,
    n_results=4,
    **kwargs,
):
    """Retrieve nearest chunks, then answer using the LLM cascade."""
    print(f"Step 4: Retrieving context for: '{question}'")
    results = collection.query(query_texts=[question], n_results=n_results)
    context = "\n\n".join(results["documents"][0])

    print("Step 5: Generating grounded answer...")
    system_prompt = "You are a careful analyst. Use ONLY the provided context to answer. If the answer is not in the context, say you don't know."
    user_prompt = f"CONTEXT:\n{context}\n\nQUESTION:\n{question}"
    answer_text, provider = generate_text(
        prompt=user_prompt,
        system_prompt=system_prompt,
    )
    print(f"Provider used: {provider}")
    return answer_text, context


def answer_without_rag(question, **kwargs):
    """Direct LLM answer without retrieval."""
    answer_text, provider = generate_text(prompt=question)
    print(f"Provider used: {provider}")
    return answer_text

2C) Run an End-to-End Query¶

Now we put it all together. The cell below runs one or more comparison questions through both pathways -- with and without RAG -- and prints the results side by side. Pay close attention to the [RETRIEVED CONTEXT PREVIEW] section: it shows you exactly which chunks the vector store returned. This is your audit trail. If the RAG answer is wrong, the first thing to check is whether the retrieved context actually contained the relevant information. A bad answer from good context is a generation problem (fix the prompt); a bad answer from irrelevant context is a retrieval problem (fix the chunking or the query).

Tip: try uncommenting the other comparison questions to see how the system handles hallucination tests (asking about fictional “Atlantis”), cross-country comparisons, and questions that blend factbook data with general world knowledge.

Function: `preview`¶

This small helper trims long model outputs and retrieved context so the classroom output stays readable. Without it, a single country’s flattened text can fill several screens, burying the pedagogical signal in noise. The max_len parameter defaults to 900 characters for answers and 1,200 for context, which is enough to see the key facts without scrolling endlessly.

comparison_questions = [
    #{
    #    "type": "Hallucination test (out-of-scope)",
    #    "question": "What is the current prime minister of Atlantis, and what is Atlantis GDP growth?",
    #},
    {
        "type": "Local info extraction",
        "question": "Summarize Mexico's economy overview in 3 bullet points.",
    },
    #{
    #    "type": "Local info extraction",
    #    "question": "What is the population and GDP of Canada?",
    #},
    #{
    #    "type": "Local + public knowledge blend",
    #    "question": "Explain how the United States trade policies may influence its North American neighbors.",
    #},
]


def preview(text, max_len=900):
    text = text or ""
    return text[:max_len] + ("..." if len(text) > max_len else "")


if "collection" not in globals():
    print("Error: Build the Chroma vector store first before running this comparison.")
elif not has_llm_provider():
    print("Error: No LLM API key configured. Set at least one API key in Colab Secrets.")
else:
    try:
        for i, item in enumerate(comparison_questions, start=1):
            q_type = item["type"]
            q = item["question"]
            print("=" * 90)
            print(f"Q{i}. {q_type}")
            print(f"QUESTION: {q}")
            print()

            direct_answer = answer_without_rag(q)
            print("[WITHOUT RAG]")
            print(preview(direct_answer))

            rag_answer, rag_context = answer_with_rag(q, collection, n_results=4)
            print("[WITH RAG]")
            print(preview(rag_answer))

            print("[RETRIEVED CONTEXT PREVIEW]")
            print(preview(rag_context, max_len=1200))
            print()
    except Exception as e:
        print(f"Error: {e}")

Checkpoint 4: Evaluation Reflection¶

Now that you have seen the with-RAG and without-RAG answers side by side, reflect on what the comparison reveals about each approach’s strengths and weaknesses.

Compare the RAG answer against the retrieved context preview. Which specific claims in the answer are directly supported by the retrieved chunks, and which (if any) appear to be the model filling in gaps?
What additional evidence would improve confidence in the answer? Would retrieving more chunks (n_results=8 instead of 4) help, or would it introduce noise?
Design one out-of-scope query (about a country not in our index, or a topic the Factbook does not cover) and predict how each approach -- with RAG and without RAG -- should respond.

Function: `preview`¶

This small helper trims long model outputs and retrieved context so the classroom output stays readable.

Why it matters: it keeps the demo focused on interpretation instead of overwhelming students with raw text.

Key takeaways¶

RAG pipelines have four phases -- ingestion, chunking, vector storage, and grounded generation -- and each can be swapped independently without rewriting the others.
Flattening nested JSON into key-path text lines gives embeddings semantic tokens to work with instead of wasting capacity on structural punctuation.
Recursive chunking with overlap prevents facts from being split across boundaries, and tuning chunk_size and chunk_overlap is the single biggest retrieval-quality knob.
ChromaDB plus all-MiniLM-L6-v2 runs entirely locally, so the full ingest-to-answer loop works offline with no API keys or billing.
Comparing with-RAG and without-RAG answers side by side exposes hallucination, and the retrieved-context preview is your audit trail for diagnosing wrong answers.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~5 minutes (data download + embedding + LLM)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/RAG_countries_NA.ipynb

RAG Example (North America): ChromaDB + LLM Cascade

1) Imports and Environment Setup¶

2) Build a Minimal RAG Pipeline¶

Function: _collect_text_lines¶

Function: _get_country_json_paths¶

Function: get_world_data¶

Function: chunk_documents¶

Checkpoint 1: Data and Chunking Reflection¶

Function: build_vector_store¶

2B) Retrieve Context and Generate a Grounded Answer¶

Functions: answer_with_rag and answer_without_rag¶

2C) Run an End-to-End Query¶

Function: preview¶

Checkpoint 4: Evaluation Reflection¶

Function: preview¶

Key takeaways¶

Run the code¶

Function: `_collect_text_lines`¶

Function: `_get_country_json_paths`¶

Function: `get_world_data`¶

Function: `chunk_documents`¶

Function: `build_vector_store`¶

Functions: `answer_with_rag` and `answer_without_rag`¶

Function: `preview`¶

Function: `preview`¶