RAG over Many Documents with ChromaDB

Learning goals:

Understand data ingestion and document construction
Learn recursive chunking for retrieval quality
Build a vector index with ChromaDB
Ground LLM generation using retrieved context

This is the full-scale sibling of rag_countries_chromadb.ipynb (North America only) and of Chapter 8’s from-scratch RAG — same seven-step pipeline, just scaled to every country on earth so the embedding and retrieval costs actually start to bite.

# Install modern package names used by current LangChain ecosystem.
# Note: text splitters are now in a dedicated package: langchain-text-splitters
%pip install -q -U google-genai chromadb langchain-community langchain-text-splitters pandas requests git+https://github.com/KarAnalytics/llm_cascade.git sentence-transformers

1) Imports and Environment Setup¶

This section imports core libraries and handles API key lookup.

In Google Colab, you can store your API keys in Secrets (e.g., GEMINI_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, etc.). Outside Colab, the code falls back to environment variables.

import os
import pandas as pd
import requests
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter

from llm_cascade import get_cascade

llm = get_cascade()

2) Build a Minimal RAG Pipeline¶

Pipeline phases:

Data ingestion from GitHub JSON documents
Chunking with overlap to preserve context continuity
Vector storage and semantic retrieval in ChromaDB
Grounded generation with Gemini

Function: `_collect_text_lines`¶

This recursive helper converts nested JSON into flat text lines that can be embedded and retrieved.

Why it matters: vector databases work on text, so structured JSON must be turned into readable, searchable content first.

# --- PHASE 1: DATA INGESTION (Pandas + GitHub) ---
def _collect_text_lines(obj, parent_key=""):
    """Recursively flatten nested JSON into readable key-value text lines."""
    lines = []

    if isinstance(obj, dict):
        for k, v in obj.items():
            next_key = f"{parent_key}.{k}" if parent_key else str(k)
            lines.extend(_collect_text_lines(v, next_key))
    elif isinstance(obj, list):
        for i, item in enumerate(obj):
            next_key = f"{parent_key}[{i}]" if parent_key else f"[{i}]"
            lines.extend(_collect_text_lines(item, next_key))
    else:
        if obj is None:
            return []
        text = str(obj).strip()
        if text:
            lines.append(f"{parent_key}: {text}")
    return lines

Function: `_get_country_json_paths`¶

This function discovers which country JSON files exist in the GitHub repository.

Why it matters: the demo stays flexible because it does not rely on a hard-coded country list.

def _get_country_json_paths():
    """Discover all country JSON files in the factbook repo (no hard-coded country list)."""
    tree_url = "https://api.github.com/repos/factbook/factbook.json/git/trees/master?recursive=1"
    r = requests.get(tree_url, timeout=30)
    r.raise_for_status()
    tree = r.json().get("tree", [])

    country_paths = []
    for item in tree:
        path = item.get("path", "")
        if item.get("type") == "blob" and path.endswith(".json") and "/" in path:
            if not path.lower().startswith(("source/", "tools/", "scripts/")):
                country_paths.append(path)

    # Deduplicate and sort for stable classroom demos.
    return sorted(set(country_paths))

Function: `get_world_data`¶

This function fetches each country record, flattens it into text, and packages it with metadata.

Why it matters: it creates the retrieval-ready documents that the rest of the RAG pipeline depends on.

def get_world_data():
    """Fetch all country records and convert each into one retrieval-ready full-text document."""
    print("Step 1: Discovering country files from GitHub...")
    base_url = "https://raw.githubusercontent.com/factbook/factbook.json/master"
    country_paths = _get_country_json_paths()
    print(f"Discovered {len(country_paths)} country files.")

    docs = []
    for path in country_paths:
        try:
            r = requests.get(f"{base_url}/{path}", timeout=20)
            r.raise_for_status()
            payload = r.json()
        except Exception as exc:
            print(f"Skipping {path}: {exc}")
            continue

        lines = _collect_text_lines(payload)
        if not lines:
            continue

        country_code = path.split("/")[-1].replace(".json", "").upper()
        combined_text = (
            f"COUNTRY_CODE: {country_code}\n"
            f"SOURCE_PATH: {path}\n\n"
            + "\n".join(lines)
        )

        docs.append(
            {
                "content": combined_text,
                "metadata": {
                    "country": country_code,
                    "path": path,
                },
            }
        )

    print(f"Loaded {len(docs)} country documents.")
    return docs

Function: `chunk_documents`¶

This function splits long country documents into overlapping chunks before indexing.

Why it matters: smaller chunks improve retrieval precision, while overlap reduces the chance of losing context at chunk boundaries.

## --- PHASE 2: RECURSIVE CHUNKING ---
def chunk_documents(docs, chunk_size=800, chunk_overlap=100):
    """Chunk long records so semantic retrieval can match smaller relevant spans."""
    print("Step 2: Chunking with overlap...")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )

    chunks, metadatas, ids = [], [], []
    for doc in docs:
        split_texts = splitter.split_text(doc["content"])
        for i, text in enumerate(split_texts):
            chunks.append(text)
            metadatas.append(doc["metadata"])
            ids.append(f"{doc['metadata']['country']}_{i}")
    return chunks, metadatas, ids


# Small inspection check for students before moving on.
docs_preview = get_world_data()
print(f"Loaded {len(docs_preview)} source documents.")
print("Example metadata:", docs_preview[0]["metadata"] if docs_preview else "No documents loaded")

Checkpoint 1: Data and Chunking Reflection¶

Why do we transform structured JSON fields into one text document per country before retrieval?
How might changing chunk_size from 800 to 300 affect precision and recall?
What is the purpose of overlap, and what failure mode does it reduce?

2A) Build and Populate the Vector Store¶

In this phase, we transform chunks into retrievable vector records inside ChromaDB.

Teaching note: for reproducible classroom runs, we clear the collection before re-inserting documents.

Function: `build_vector_store`¶

This function creates the Chroma collection, clears older demo records, and writes new chunk batches into the vector store.

Why it matters: it is the bridge between text preprocessing and semantic retrieval.

**This code ran for 30 minutes for me, so be warned. Try GPU/TPU in runtime to speedup. **

def build_vector_store(docs, db_path="./factbook_db", collection_name="world_facts", batch_size=5000):
    """Create (or reset) a Chroma collection and upsert chunked records in batches."""
    chunks, metadatas, ids = chunk_documents(docs)

    chroma_client = chromadb.PersistentClient(path=db_path)
    collection = chroma_client.get_or_create_collection(name=collection_name)

    # Clear older records so each class run starts cleanly.
    existing = collection.get(include=[])
    if existing.get("ids"):
        collection.delete(ids=existing["ids"])

    print("Step 3: Storing vectors in ChromaDB...")
    for start in range(0, len(ids), batch_size):
        stop = start + batch_size
        collection.upsert(
            documents=chunks[start:stop],
            metadatas=metadatas[start:stop],
            ids=ids[start:stop],
        )

    print(f"Stored {len(ids)} chunks in collection '{collection_name}'.")
    return collection

collection = build_vector_store(docs_preview)

Checkpoint 2: Vector Store Reflection¶

Why do we clear old records before upserting during a classroom demo?
What reproducibility issue appears if we repeatedly insert without cleanup?
If this were production, when would you keep historical vectors instead?

2B) Retrieve Context and Generate a Grounded Answer¶

This phase demonstrates the core RAG idea: retrieve relevant evidence first, then condition the LLM on that evidence.

Prompting rule used here: answer only from retrieved context; otherwise say you do not know.

Functions: `answer_with_rag` and `answer_without_rag`¶

These two functions create the core teaching comparison in the notebook.

answer_with_rag retrieves evidence before generation. answer_without_rag sends the question directly to Gemini with no retrieved context.

def answer_with_rag(question, collection, n_results=4):
    '''Retrieve nearest chunks, then ask LLM to answer using only retrieved context.'''
    print(f"Step 4: Retrieving context for: '{question}'")
    results = collection.query(query_texts=[question], n_results=n_results)
    context = (chr(10) + chr(10)).join(results['documents'][0])

    print('Step 5: Generating grounded answer with LLM...')
    system_prompt = 'You are a careful analyst. Use ONLY the provided context to answer. If the answer is not in the context, say you don' + chr(39) + 't know.'
    prompt = 'CONTEXT:' + chr(10) + context + chr(10) + chr(10) + 'QUESTION:' + chr(10) + question
    response = llm.generate(prompt, system_prompt=system_prompt)
    return response.text, context


def answer_without_rag(question):
    '''Direct LLM answer without retrieval for comparison against RAG.'''
    response = llm.generate(question)
    return response.text

Checkpoint 3: Retrieval and Prompting Reflection¶

Why do we retrieve first and generate second in a RAG system?
What is the effect of changing n_results from 2 to 5?
How does the prompt reduce hallucinations, and where can it still fail?

2C) Run an End-to-End Query¶

You can now test questions and inspect both the answer and retrieved evidence.

Tip: print retrieved context in class to discuss why the model answered the way it did.

Function: `preview`¶

This small helper trims long model outputs and retrieved context so the classroom output stays readable.

Why it matters: it keeps the demo focused on interpretation instead of overwhelming students with raw text.

comparison_questions = [
    {
        "type": "Hallucination test (out-of-scope)",
        "question": "What is the current prime minister of Atlantis, and what is Atlantis GDP growth?",
    },
    {
        "type": "Local info extraction",
        "question": "Summarize Japan's economy overview in 3 bullet points.",
    },
    {
        "type": "Local info extraction",
        "question": "Compare inflation-related details for Brazil and Canada.",
    },
    {
        "type": "Local + public knowledge blend",
        "question": "Explain how Japan's economic profile may influence its role in global supply chains today.",
    },
]


def preview(text, max_len=900):
    text = text or ""
    return text[:max_len] + ("..." if len(text) > max_len else "")


if "collection" not in globals():
    print("Error: Build the Chroma vector store in Cell 9 before running this comparison.")
else:
    try:
        for i, item in enumerate(comparison_questions, start=1):
            q_type = item["type"]
            q = item["question"]
            print("\n" + "=" * 90)
            print(f"Q{i}. {q_type}")
            print(f"QUESTION: {q}\n")

            direct_answer = answer_without_rag(q)
            print("[WITHOUT RAG]")
            print(preview(direct_answer))

            rag_answer, rag_context = answer_with_rag(q, collection, n_results=4)
            print("\n[WITH RAG]")
            print(preview(rag_answer))

            print("\n[RETRIEVED CONTEXT PREVIEW]")
            print(preview(rag_context, max_len=1200))
    except Exception as e:
        print(f"Error: {e}")

Checkpoint 4: Evaluation Reflection¶

Compare the final answer against the retrieved context. Which statements are directly supported?
What additional evidence would improve confidence in the answer?
Design one out-of-scope query and explain how the system should respond.

3) Teaching Notes and Suggested Exercises¶

Exercises for class discussion:

Change chunk_size and chunk_overlap; evaluate answer quality
Increase n_results and observe grounding changes
Add more countries and test comparative questions
Intentionally ask out-of-scope questions to inspect hallucination control

Key takeaways¶

Scaling from regional to global (200+ countries) shows how embedding cost and runtime grow linearly with the corpus -- plan for GPU/TPU when indexing at scale.
Dynamic discovery via the GitHub Trees API keeps the pipeline flexible, so adding or removing countries does not require editing a hard-coded list.
Clearing the collection before each upsert keeps classroom runs reproducible, but in production you would version collections or deduplicate incrementally instead.
Out-of-scope questions (like “Atlantis”) reveal whether the “answer only from context” prompt actually prevents hallucination or just softens it.
Tuning n_results and chunk size trades precision for recall -- more chunks add evidence but also noise that can dilute the grounded answer.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~10 minutes (200+ countries, embedding heavy)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/RAG_allcountries_ChromaDB.ipynb