LlamaIndex RAG: A Simpler Single-Company Demo

This notebook is the beginner-friendly version of our LlamaIndex RAG pipeline. Rather than working with multiple data sources or complex configurations, we focus on a single fictional company (Acme Analytics) with just five short documents. This makes it easy to see exactly what LlamaIndex does at each step without getting lost in data complexity.

If you are new to RAG frameworks, start here. The notebook walks through the complete LlamaIndex workflow -- loading documents, building a vector index, querying with retrieval-augmented generation, and comparing RAG answers against a direct LLM call. Once you are comfortable with these basics, move on to the full llamaindex_rag.ipynb sibling, which handles multiple companies, custom chunking, and more advanced configuration knobs.

LlamaIndex RAG Demo: Document Q&A with a Framework¶

This notebook demonstrates how LlamaIndex simplifies building RAG pipelines. Instead of manually chunking documents, computing embeddings, and querying vector stores, LlamaIndex handles the plumbing so you can focus on the application logic.

How it works:

Create a small set of synthetic company documents (no external data needed).
Use LlamaIndex to ingest, chunk, embed, and index the documents in one step.
Given a natural-language question, LlamaIndex retrieves relevant chunks and sends them to an LLM.
Compare answers with RAG (LlamaIndex pipeline) vs. without RAG (direct LLM call).

Learning goals:

Understand what a RAG framework (LlamaIndex) does under the hood
See how LlamaIndex abstracts document loading, chunking, embedding, and retrieval
Compare the framework approach to the manual RAG we built in earlier notebooks
Observe how RAG grounding eliminates hallucination on private/synthetic data

Provider setup: This notebook uses the llm_cascade package, which auto-detects your API keys and falls back to the next provider if one is unavailable. Supported providers: OpenAI, Gemini, Ollama, Grok (xAI), Groq, HuggingFace, Cohere, OpenRouter.

Store any of these API keys in Colab Secrets (or a local .env file): OPENAI_API_KEY, GEMINI_API_KEY, OLLAMA_API_KEY, XAI_API_KEY, GROQ_API_KEY, HF_TOKEN, COHERE_API_KEY, OPENROUTER_API_KEY

!pip install -q -U llama-index llama-index-llms-openai-like llama-index-llms-gemini llama-index-embeddings-huggingface google-genai openai git+https://github.com/KarAnalytics/llm_cascade.git

1) Imports and Provider Helpers (7-Vendor Cascade)¶

We configure LlamaIndex to use whichever LLM provider you have available. The cascade tries each provider in order and falls through on quota errors.

For embeddings, we use a local HuggingFace model (all-MiniLM-L6-v2) so that embedding always works regardless of which LLM API key you have.

from pathlib import Path
from llm_cascade.providers import PROVIDERS, _load_env, _get_key, _is_retriable_error

_load_env()

def get_available_providers():
    return [p for p in PROVIDERS if _get_key(p['key_env'])]

def has_llm_provider():
    return len(get_available_providers()) > 0

# Print status
available = get_available_providers()
if available:
    print('Providers configured (in fallback order):')
    for p in available:
        print(f"  + {p['name']:<16} model = {p['default_model']}")
else:
    print('WARNING: No API keys found.')

2) Configure LlamaIndex LLM and Embeddings¶

LlamaIndex needs two things:

An LLM for generating answers (we pick the first available provider from our cascade)
An embedding model for converting text to vectors (we use a local HuggingFace model — no API key needed)

This cell configures LlamaIndex’s global Settings so all downstream components use our chosen models.

from llama_index.core import Settings
from llama_index.core.llms import CustomLLM, LLMMetadata, CompletionResponse, CompletionResponseGen
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llm_cascade import get_cascade
from typing import Any

# ---- Embedding model (local, no API key needed) -----------------------------
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Embedding model: sentence-transformers/all-MiniLM-L6-v2 (local)")


# ---- Custom LlamaIndex LLM that wraps llm_cascade for automatic fallback ----
class CascadeLLM(CustomLLM):
    """LlamaIndex LLM that delegates to llm_cascade (8-provider fallback)."""
    context_window: int = 8000
    num_output: int = 1024
    model_name: str = "llm_cascade"

    @property
    def metadata(self) -> LLMMetadata:
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        cascade = get_cascade(verbose=False)
        response = cascade.generate(prompt)
        return CompletionResponse(text=response.text, additional_kwargs={"provider": response.provider, "model": response.model})

    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        # Non-streaming fallback
        resp = self.complete(prompt, **kwargs)
        def gen():
            yield resp
        return gen()


Settings.llm = CascadeLLM()
print("LLM: CascadeLLM (auto-fallback across available providers)")

3) Create Synthetic Company Documents¶

We generate a small set of fictional company documents that the LLM has never seen in training. This ensures that correct answers can only come from RAG retrieval, not from parametric memory.

Document	Topic
Company Overview	Acme Analytics Inc. — founding, mission, HQ
Products & Services	Three product lines with pricing
Q4 2025 Financial Results	Revenue, profit, growth numbers
Employee Handbook (excerpt)	PTO policy, remote work rules
Customer Case Study	BigRetail Corp deployment results

DOCUMENTS = {
    "company_overview.txt": """ACME ANALYTICS INC. — COMPANY OVERVIEW

Acme Analytics Inc. was founded in 2019 by Dr. Sarah Chen and Marcus Rivera
in Lawrence, Kansas. The company specializes in AI-powered business intelligence
tools for mid-market companies (100–5,000 employees).

Headquarters: 1420 Jayhawk Boulevard, Lawrence, KS 66045
Employees: 287 (as of January 2026)
Annual Revenue (2025): $42.3 million
Funding: Series B ($18M raised in March 2023 from Midwest Ventures)

Mission: "To democratize data analytics so every business decision is informed
by evidence, not intuition."

The company operates three offices: Lawrence (HQ), Chicago (sales), and
Austin (engineering). CEO: Dr. Sarah Chen. CTO: Marcus Rivera. CFO: Linda Park.""",

    "products_and_services.txt": """ACME ANALYTICS — PRODUCTS AND SERVICES

1. InsightBoard Pro (Flagship Product)
   - Real-time dashboard platform with natural language query interface
   - Pricing: $49/user/month (Standard), $89/user/month (Enterprise)
   - Supports PostgreSQL, MySQL, Snowflake, BigQuery, and Redshift
   - 1,847 active enterprise customers as of Q4 2025

2. DataPipe ETL
   - Automated data pipeline builder with 200+ pre-built connectors
   - Pricing: starts at $500/month for up to 10 million records/day
   - Launched in September 2024

3. PredictIQ
   - ML-powered forecasting add-on for InsightBoard Pro
   - Pricing: $29/user/month (requires InsightBoard Pro subscription)
   - Uses proprietary time-series model trained on retail and logistics data
   - Beta launched March 2025, GA release planned for June 2026

All products include 24/7 email support. Enterprise plans include dedicated
account manager and 99.9% SLA.""",

    "q4_2025_financials.txt": """ACME ANALYTICS — Q4 2025 FINANCIAL RESULTS (CONFIDENTIAL)

Period: October 1 – December 31, 2025

Revenue:          $12.8 million (Q4) / $42.3 million (FY 2025)
Gross Margin:     78.2%
Operating Profit: $1.9 million (Q4) / $5.1 million (FY 2025)
Net Income:       $1.4 million (Q4)
Cash on Hand:     $14.7 million
Burn Rate:        Company is cash-flow positive since Q2 2025

Key Metrics:
- ARR (Annual Recurring Revenue): $48.6 million (up 34% YoY)
- Net Revenue Retention: 118%
- Customer Acquisition Cost (CAC): $8,200
- Customer Lifetime Value (LTV): $67,400
- LTV/CAC Ratio: 8.2x

Headcount grew from 241 to 287 employees during 2025.
R&D spending: 31% of revenue. Sales & Marketing: 28% of revenue.""",

    "employee_handbook_excerpt.txt": """ACME ANALYTICS — EMPLOYEE HANDBOOK (EXCERPT)

PAID TIME OFF (PTO):
- All full-time employees receive 22 days of PTO per year (accrued monthly).
- PTO increases to 27 days after 3 years of service.
- Unused PTO can be carried over (max 5 days) or paid out at year-end.
- Sick leave: 10 days per year (separate from PTO).

REMOTE WORK POLICY:
- Engineering and Data Science teams: fully remote eligible.
- Sales and Customer Success: hybrid (minimum 2 days/week in office).
- All employees may work remotely up to 4 weeks/year from any US location.
- International remote work requires VP approval and tax review.

PROFESSIONAL DEVELOPMENT:
- Annual learning budget: $2,500 per employee.
- Conference attendance: up to 2 conferences per year with manager approval.
- Tuition reimbursement: up to $5,250/year for degree programs.""",

    "customer_case_study.txt": """CUSTOMER CASE STUDY: BIGRETAIL CORP

Company: BigRetail Corp (1,200 retail stores across 38 states)
Challenge: Siloed data across POS, inventory, and CRM systems made it
impossible for regional managers to get timely insights.

Solution: Deployed InsightBoard Pro Enterprise + DataPipe ETL
- Connected 14 data sources in 3 weeks using DataPipe
- 340 regional managers now use InsightBoard daily
- Natural language queries replaced manual SQL report requests

Results (after 6 months):
- 62% reduction in time-to-insight (from 4 days to 1.5 days average)
- $3.2 million saved in inventory carrying costs
- 23% increase in regional manager satisfaction scores
- SQL report request backlog eliminated entirely

Quote from BigRetail CIO Janet Torres:
\"InsightBoard Pro transformed how our managers interact with data.
They went from waiting days for a report to asking questions in plain
English and getting answers in seconds.\"""",
}

# Write documents to disk
DOC_DIR = Path("acme_docs")
DOC_DIR.mkdir(exist_ok=True)

for filename, content in DOCUMENTS.items():
    (DOC_DIR / filename).write_text(content, encoding="utf-8")

print(f"Created {len(DOCUMENTS)} documents in '{DOC_DIR}/':")
for f in sorted(DOC_DIR.iterdir()):
    print(f"  {f.name} ({f.stat().st_size} bytes)")

4) Ingest Documents into LlamaIndex¶

This is where LlamaIndex shines — one line to load all documents, and one line to build a searchable index.

Under the hood, LlamaIndex:

Reads each .txt file
Splits text into chunks (default ~1024 tokens with overlap)
Computes embeddings for each chunk using our HuggingFace model
Stores them in an in-memory vector index

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Step 1: Load documents from the folder
documents = SimpleDirectoryReader(str(DOC_DIR)).load_data()
print(f"Loaded {len(documents)} document chunks")

# Step 2: Build vector index (embeds + indexes all chunks)
index = VectorStoreIndex.from_documents(documents, show_progress=True)
print(f"Vector index built with {len(documents)} chunks")

# Step 3: Create a query engine
query_engine = index.as_query_engine(similarity_top_k=3)
print("Query engine ready (top_k=3 retrieval)")

5) What Just Happened? Peeking Under the Hood¶

Let’s inspect what LlamaIndex created — how many chunks, what the embeddings look like, and how retrieval works before we query the LLM.

# Inspect the chunks LlamaIndex created
retriever = index.as_retriever(similarity_top_k=3)

# Test retrieval without LLM — just see what chunks come back
test_query = "How much revenue did Acme make in Q4 2025?"
retrieved_nodes = retriever.retrieve(test_query)

print(f"Query: '{test_query}'")
print(f"Retrieved {len(retrieved_nodes)} chunks:\n")
for i, node in enumerate(retrieved_nodes, 1):
    source = node.metadata.get('file_name', 'unknown')
    score = node.score
    text_preview = node.text[:200].replace('\n', ' ')
    print(f"  Chunk #{i} (score={score:.4f}, source={source})")
    print(f"    {text_preview}...\n")

6) Query with RAG vs. Without RAG¶

Now we compare:

With RAG (LlamaIndex): The query engine retrieves relevant chunks, sends them + the question to the LLM, and returns a grounded answer.
Without RAG (Direct LLM): We ask the same question directly to the LLM with no context. Since the data is fictional, the LLM either refuses or hallucinates.

For the “without RAG” call we use generate_text() from our 7-vendor cascade.

from llm_cascade import get_cascade

_llm_raw = get_cascade(verbose=False)


def generate_text_no_rag(prompt, system_prompt=None):
    """Direct LLM call via llm_cascade (no RAG context)."""
    response = _llm_raw.generate(prompt, system_prompt=system_prompt)
    return response.text, response.provider


print('Direct LLM function ready (for without-RAG comparison).')

7) Run End-to-End Examples: With RAG vs. Without RAG¶

We test questions about Acme Analytics — a fictional company that no LLM has seen in training. This makes the with/without RAG comparison very clear:

With RAG: correct, specific answers grounded on our documents
Without RAG: refusals, hedging, or hallucinated numbers

questions = [
    #"What was Acme Analytics' revenue in Q4 2025?",
    "How many days of PTO do new employees get at Acme?",
    #"What products does Acme Analytics offer and what do they cost?",
    #"How much did BigRetail Corp save using Acme's products?",
    #"Who founded Acme Analytics and where is the company headquartered?",
]


def preview(text, max_len=800):
    text = str(text) if text else ""
    return text[:max_len] + ("..." if len(text) > max_len else "")


if not has_llm_provider():
    print("Error: No LLM API key configured. Set at least one API key in Colab Secrets.")
else:
    for i, q in enumerate(questions, start=1):
        print("=" * 80)
        print(f"Q{i}. {q}")
        print("=" * 80)

        # --- WITH RAG (LlamaIndex) ---
        print("--- WITH RAG (LlamaIndex) ---")
        try:
            response = query_engine.query(q)
            print("  Answer:")
            print(preview(response.response))
            sources = set()
            for node in response.source_nodes:
                sources.add(node.metadata.get("file_name", "unknown"))
            print(f"  Sources: {', '.join(sources)}")
        except Exception as e:
            print(f"  RAG error: {e}")

        # --- WITHOUT RAG ---
        print("--- WITHOUT RAG (LLM knowledge only) ---")
        try:
            answer_direct, prov_direct = generate_text_no_rag(q)
            print(f"  Answer (provider: {prov_direct}):")
            print(preview(answer_direct))
        except Exception as e:
            print(f"  Direct error: {e}")

        print()

Checkpoint: Reflection Questions¶

Hallucination check: Did the without-RAG answers invent specific revenue numbers or employee counts for a fictional company?
Source attribution: LlamaIndex tells us which document the answer came from. Why is this important for trust?
Chunking: How does the chunk size affect retrieval quality? What if a key fact spans two chunks?
Framework vs. manual: Compare this LlamaIndex approach to the manual RAG we built in the VectorDB and DBMS_RAG notebooks. What did the framework handle for us?

Key takeaways¶

Two-line RAG -- SimpleDirectoryReader plus VectorStoreIndex.from_documents is enough to turn a folder of .txt files into a working Q&A system.
Global Settings centralize your LLM and embedding choices so every downstream component inherits them automatically.
Local embeddings via HuggingFace’s all-MiniLM-L6-v2 keep the pipeline API-key-free for the vector step, isolating cost to the LLM call.
Source nodes returned alongside every answer show which chunk was retrieved, making hallucinations easy to spot.
Synthetic Acme data guarantees the LLM has zero prior knowledge, so any correct specifics must be coming from retrieval.

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~3 minutes

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/LlamaIndex_RAG_simple_single_company.ipynb