RAG with a Featherweight LLM

Can Retrieval-Augmented Generation work with a model so small it barely qualifies as a “large” language model? In this notebook we find out. We wire up SmolLM2-135M — a HuggingFace model with only 135 million parameters — into a full LangChain RAG pipeline backed by ChromaDB. The point is not to get GPT-4-quality answers; it is to prove that the pattern of RAG (retrieve relevant context, stuff it into a prompt, generate) works at any scale. If you understand this notebook, you understand the skeleton that every production RAG system is built on — the only difference is swapping in a bigger brain.

The retrieval half of this pipeline is the same cosine-similarity-over-embeddings dance we implemented by hand in Chapter 8’s rag_first_principles.ipynb; what we are stress-testing here is whether a tiny generator can still make use of it.

Install Dependencies¶

The cell below installs everything we need in one shot. LangChain provides the orchestration glue — it connects the language model, the vector store, and the retrieval logic into a single pipeline. ChromaDB (langchain-chroma) is an open-source vector database that will store our document embeddings. HuggingFace Transformers gives us access to the SmolLM2 model itself, and sentence-transformers provides the embedding model that converts text into vectors. Think of this as assembling a toolkit: LangChain is the workbench, ChromaDB is the filing cabinet, and the HuggingFace libraries are the engine.

1. Load the “Micro” Model¶

Here is where the experiment gets interesting. SmolLM2-135M-Instruct has only 135 million parameters — roughly 500 times smaller than GPT-3 and over a thousand times smaller than GPT-4. We chose it deliberately: it is small enough to run on a free Colab GPU in seconds, yet it has been instruction-tuned, so it understands the format of a question-and-answer prompt. We load it in half-precision (float16) and force it onto the GPU with device_map="cuda" to keep inference fast.

The max_new_tokens=50 cap is another deliberate choice. A 135M-parameter model tends to ramble or repeat itself if you let it generate too much text. Capping the output at 50 tokens forces it to be concise — and lets us see quickly whether it actually absorbed the retrieved context or just hallucinated. Finally, we wrap the HuggingFace pipeline in HuggingFacePipeline so that LangChain can call it like any other LLM.

2. Upload Your Document¶

RAG needs a knowledge source. Here we ask you to upload a plain-text file — this could be a Wikipedia article, a company memo, or any prose document. The file is read into a single Python string that we will chunk and embed in the next step. In a production system this upload step would be replaced by a document ingestion pipeline, but for a classroom demo, drag-and-drop keeps things simple.

3. Chunk, Embed, and Build the RAG Pipeline¶

This single cell does three things that form the backbone of any RAG system.

Chunking. The CharacterTextSplitter breaks the uploaded text into 1,000-character pieces with no overlap. Chunking matters because embedding models and LLMs have limited context windows — feeding them the entire document at once would either truncate it or dilute the signal. Smaller chunks mean each vector represents a focused topic, which improves retrieval precision.

Embedding and storing. Each chunk is converted into a 384-dimensional vector by all-MiniLM-L6-v2, a lightweight sentence-transformer. These vectors are stored in ChromaDB, an in-memory vector database. When a question arrives, ChromaDB will compare the question’s vector against every chunk vector using cosine similarity and return the closest matches.

Pipeline assembly. RetrievalQA.from_chain_type ties the retriever to the LLM. We set k=1 — retrieve only the single best-matching chunk — to keep things fast and to make it crystal clear which piece of context the model is working from. The "stuff" chain type simply concatenates the retrieved chunk into the prompt; other chain types exist for multi-document summarization, but “stuff” is the simplest and most transparent.

4. Query the Pipeline¶

Now we put the pipeline to the test. When you call rag.invoke(query), LangChain performs three steps behind the scenes: (1) embed the question, (2) retrieve the best-matching chunk from ChromaDB, and (3) pass both the chunk and the question to SmolLM2 for generation. Look at the output carefully — you will likely see the model echo the LangChain prompt template and the retrieved context before its answer. This is a quirk of very small models: they do not always know where the “instructions” end and the “answer” begins. A larger model would suppress the template and give you a clean response.

5. A Second Query¶

We run a second question to see how the model handles a different type of information need. Notice that the retrieval step may or may not pull a different chunk this time — it depends on which chunk’s embedding is closest to this new question. The quality of the answer will vary: sometimes the tiny model nails it, sometimes it trails off or repeats itself. That inconsistency is the price of using a 135M-parameter model — and it is exactly why production systems use larger models for generation while keeping the same retrieval architecture you see here.

Takeaways: What Did We Learn?¶

The most important lesson from this notebook is that RAG is an architecture, not a model. The retrieve-then-generate pattern works whether the generator has 135 million parameters or 1.8 trillion. The retrieval side (chunking, embedding, vector search) was actually quite good here — cosine similarity found the right context every time. It was the generation side that struggled, because a tiny model lacks the reasoning capacity to synthesize a clean, well-structured answer.

In a real deployment, you would keep ChromaDB (or a production vector store like Pinecone or Weaviate) and swap SmolLM2 for a capable model like GPT-4o or Gemini. The retrieval pipeline stays the same; only the last mile — the LLM call — changes. That modularity is the whole point of the LangChain abstraction, and it is why understanding this featherweight demo transfers directly to building enterprise-grade RAG systems.

## Large chunks mean less work for the vector DB
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents([raw_text])

embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
db = Chroma.from_documents(docs, embedding_model)

# --- 3. The RAG Pipeline ---
# Use k=1 to make it even faster (only looks at the single best match)
rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=db.as_retriever(search_kwargs={"k": 1})
)

## --- 4. Run the Query ---
query = "What is the main topic of the document?"
response = rag.invoke(query)

print(f"\n--- Answer ---\n{response['result']}")

# --- 5. Run the Query ---
query = "When did the event begin, when will it end?"
response = rag.invoke(query)

print(f"\n--- Answer ---\n{response['result']}")

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~8 minutes on T4 GPU (local model loading)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/Simple_RAG_using_featherweightAI.ipynb