RAG with Images: Image Retrieval and Multimodal Q&A

This notebook demonstrates Retrieval-Augmented Generation (RAG) where the knowledge source is a collection of images rather than text documents or a database.

How it works:

Build a small image dataset (synthetic product catalog images generated in-notebook — no external data needed).
Compute CLIP embeddings for each image to create a searchable vector index.
Given a natural-language query, compute a text embedding and retrieve the most relevant images.
Send the retrieved images to a multimodal LLM which generates a grounded, natural-language answer.

Learning goals:

Understand how RAG extends to visual/multimodal data
See how CLIP jointly embeds text and images into a shared vector space
Practice grounding LLM output on retrieved image evidence
Compare answers with and without image context

Provider setup: Gemini is tried first; if quota is exhausted, Ollama Cloud is used automatically.
Store your LLM API keys in Colab Secrets (or a local .env file). Supported keys: GEMINI_API_KEY, OPENAI_API_KEY, GROQ_API_KEY, HF_TOKEN, COHERE_API_KEY, XAI_API_KEY, OLLAMA_API_KEY, OPENROUTER_API_KEY.

This is the same retrieve-then-generate pattern from Chapter 8’s rag_first_principles.ipynb applied to a new modality — CLIP is essentially the embedding idea from Chapter 2, extended so that pictures and sentences land in the same vector space.

!pip install -q google-genai Pillow==11.0.0 transformers==4.46.3 git+https://github.com/KarAnalytics/llm_cascade.git

1) Imports and Provider Helpers¶

We use CLIP (from Hugging Face transformers) to embed images and text into the same vector space, and PIL/Pillow for image handling.

The provider helpers below follow the same pattern used in the DBMS RAG notebook: Gemini first, Ollama Cloud fallback, keys from Colab Secrets or .env.

import numpy as np
from pathlib import Path

from PIL import Image, ImageDraw, ImageFont

from llm_cascade import get_cascade

llm = get_cascade()


def generate_text(prompt, system_prompt=None):
    response = llm.generate(prompt, system_prompt=system_prompt)
    return response.text, response.provider


def has_llm_provider():
    return True  # llm_cascade auto-detects available providers


print("Provider ready (llm_cascade)")

2) Generate a Synthetic Image Dataset¶

Since we do not have external image data, we create a small synthetic product catalog consisting of simple colored shapes with labels. Each image represents a product with specific visual attributes (shape, color, size, pattern).

This keeps the demo self-contained while illustrating the full Image RAG pipeline.

Product	Shape	Color	Pattern
Widget A	Circle	Red	Solid
Widget B	Square	Blue	Striped
Gadget X	Triangle	Green	Solid
Gadget Y	Circle	Yellow	Dotted
Tool M	Rectangle	Purple	Solid
Tool N	Diamond	Orange	Striped

import math

# --- Product catalog definition ---
PRODUCTS = [
    {"name": "Widget A", "shape": "circle",    "color": "#E74C3C", "color_name": "red",    "pattern": "solid",   "price": 12.99, "category": "widgets"},
    {"name": "Widget B", "shape": "square",    "color": "#3498DB", "color_name": "blue",   "pattern": "striped", "price": 24.50, "category": "widgets"},
    {"name": "Gadget X", "shape": "triangle",  "color": "#2ECC71", "color_name": "green",  "pattern": "solid",   "price": 8.75,  "category": "gadgets"},
    {"name": "Gadget Y", "shape": "circle",    "color": "#F1C40F", "color_name": "yellow", "pattern": "dotted",  "price": 15.00, "category": "gadgets"},
    {"name": "Tool M",   "shape": "rectangle", "color": "#9B59B6", "color_name": "purple", "pattern": "solid",   "price": 34.99, "category": "tools"},
    {"name": "Tool N",   "shape": "diamond",   "color": "#E67E22", "color_name": "orange", "pattern": "striped", "price": 19.25, "category": "tools"},
    {"name": "Widget C", "shape": "circle",    "color": "#1ABC9C", "color_name": "teal",   "pattern": "solid",   "price": 11.50, "category": "widgets"},
    {"name": "Gadget Z", "shape": "square",    "color": "#E91E63", "color_name": "pink",   "pattern": "dotted",  "price": 29.99, "category": "gadgets"},
]


def draw_product_image(product, size=256):
    """Generate a synthetic product image with shape, color, pattern, and label."""
    img = Image.new("RGB", (size, size), "white")
    draw = ImageDraw.Draw(img)
    margin = 40
    cx, cy = size // 2, size // 2
    r = size // 2 - margin  # radius or half-size

    color = product["color"]
    shape = product["shape"]

    # Draw the shape
    if shape == "circle":
        draw.ellipse([cx - r, cy - r, cx + r, cy + r], fill=color, outline="black", width=2)
    elif shape == "square":
        draw.rectangle([cx - r, cy - r, cx + r, cy + r], fill=color, outline="black", width=2)
    elif shape == "rectangle":
        draw.rectangle([cx - r, cy - r // 2, cx + r, cy + r // 2], fill=color, outline="black", width=2)
    elif shape == "triangle":
        points = [(cx, cy - r), (cx - r, cy + r), (cx + r, cy + r)]
        draw.polygon(points, fill=color, outline="black", width=2)
    elif shape == "diamond":
        points = [(cx, cy - r), (cx + r, cy), (cx, cy + r), (cx - r, cy)]
        draw.polygon(points, fill=color, outline="black", width=2)

    # Add pattern overlay
    if product["pattern"] == "striped":
        for y in range(0, size, 12):
            draw.line([(0, y), (size, y)], fill="white", width=2)
    elif product["pattern"] == "dotted":
        for dx in range(margin, size - margin, 20):
            for dy in range(margin, size - margin, 20):
                draw.ellipse([dx - 3, dy - 3, dx + 3, dy + 3], fill="white")

    # Add product name label at top
    try:
        font = ImageFont.truetype("arial.ttf", 16)
    except (OSError, IOError):
        font = ImageFont.load_default()
    bbox = draw.textbbox((0, 0), product["name"], font=font)
    tw = bbox[2] - bbox[0]
    draw.text(((size - tw) // 2, 5), product["name"], fill="black", font=font)

    # Add price label at bottom
    price_text = f"${product['price']:.2f}"
    bbox = draw.textbbox((0, 0), price_text, font=font)
    tw = bbox[2] - bbox[0]
    draw.text(((size - tw) // 2, size - 25), price_text, fill="black", font=font)

    return img


# Generate and save all product images
IMAGE_DIR = Path("product_images")
IMAGE_DIR.mkdir(exist_ok=True)

image_paths = []
for product in PRODUCTS:
    img = draw_product_image(product)
    filename = product["name"].lower().replace(" ", "_") + ".png"
    path = IMAGE_DIR / filename
    img.save(path)
    image_paths.append(path)

print(f"Generated {len(image_paths)} product images in '{IMAGE_DIR}/'")
print("Files:", [p.name for p in image_paths])

Preview the Generated Images¶

Before we index anything, it is worth taking a moment to look at our dataset. The images below are intentionally simple -- colored geometric shapes with text labels -- but they carry enough visual variety (shape, color, pattern, price) to make retrieval interesting. In a production system these would be real product photographs, but synthetic images let us control every attribute and verify that CLIP picks up on the right visual cues.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(14, 7))
for ax, product, path in zip(axes.flat, PRODUCTS, image_paths):
    img = Image.open(path)
    ax.imshow(img)
    ax.set_title(f"{product['name']} ({product['color_name']})", fontsize=10)
    ax.axis("off")
plt.suptitle("Synthetic Product Catalog", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

3) Build the Image Index with CLIP Embeddings¶

CLIP (Contrastive Language-Image Pre-training) is the key technology that makes Image RAG possible. Unlike traditional image classifiers that output a fixed set of labels, CLIP maps both text and images into the same 512-dimensional vector space. It was trained on hundreds of millions of image-caption pairs from the internet, learning to push matching image-text pairs close together and non-matching pairs apart.

Why does this matter for RAG? Because once images and text live in a shared space, we can measure the similarity between a user’s natural-language query and every image in our catalog using ordinary cosine similarity -- no manual tagging, no predefined categories. The model “understands” that the text “red circular product” should be close to an image of a red circle, even though those are entirely different data modalities. This is analogous to how ChromaDB uses text embeddings in document RAG, but here the bridge spans from language to vision.

We compute an embedding for each product image below, normalize them to unit vectors, and store the result as our searchable image index.

from transformers import CLIPProcessor, CLIPModel
import torch

# Load CLIP model (downloads ~600 MB on first run)
CLIP_MODEL_NAME = "openai/clip-vit-base-patch32"
clip_model = CLIPModel.from_pretrained(CLIP_MODEL_NAME)
clip_processor = CLIPProcessor.from_pretrained(CLIP_MODEL_NAME)

print(f"CLIP model loaded: {CLIP_MODEL_NAME}")
print(f"Embedding dimension: {clip_model.config.projection_dim}")

def embed_images(image_paths, model, processor):
    """Compute CLIP embeddings for a list of image files."""
    images = [Image.open(p).convert("RGB") for p in image_paths]
    inputs = processor(images=images, return_tensors="pt", padding=True)
    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)
    # Normalize to unit vectors for cosine similarity
    embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
    return embeddings.numpy()


def embed_text(query, model, processor):
    """Compute CLIP embedding for a text query."""
    inputs = processor(text=[query], return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        embeddings = model.get_text_features(**inputs)
    embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
    return embeddings.numpy()


# Build the image index
image_embeddings = embed_images(image_paths, clip_model, clip_processor)
print(f"Image index built: {image_embeddings.shape[0]} images, {image_embeddings.shape[1]}-dim embeddings")

With our index built -- 8 images, each represented as a 512-dimensional unit vector -- we are ready to perform cross-modal search. Notice that the embedding functions above handle text and images through different CLIP encoders (get_image_features vs. get_text_features), but both produce vectors in the same space. That shared geometry is what lets us compare apples (text) to oranges (images) using a single dot product.

4) Image Retrieval -- the “Retrieval” in Image RAG¶

Now that every product image has been mapped to a 512-dimensional vector, we can perform retrieval. Given a text query, we embed it with CLIP’s text encoder and compute cosine similarity between the query vector and every image vector in our index. Because we normalized all embeddings to unit length earlier, this cosine similarity reduces to a simple dot product -- a single matrix multiplication gives us a ranked list of all images.

A word about the similarity scores you will see below: CLIP similarities tend to be modest in absolute terms (often 0.2-0.4 for good matches) because the 512-dimensional space is vast and the model must accommodate enormous variety. What matters is the relative ordering -- the top-ranked image should be the most visually relevant to the query, even if its raw score looks low. Think of the score not as a confidence percentage but as a ranking signal.

This is the image equivalent of vector similarity search in document RAG.

def retrieve_images(query, image_embeddings, image_paths, products, model, processor, top_k=3):
    """Retrieve the top-k most relevant images for a text query."""
    query_emb = embed_text(query, model, processor)
    # Cosine similarity (embeddings are already normalized)
    similarities = (query_emb @ image_embeddings.T).flatten()
    # Get top-k indices
    top_indices = similarities.argsort()[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
            "rank": len(results) + 1,
            "product": products[idx],
            "path": image_paths[idx],
            "similarity": float(similarities[idx]),
        })
    return results


# Test retrieval
test_query = "red circular product"
results = retrieve_images(test_query, image_embeddings, image_paths, PRODUCTS, clip_model, clip_processor)

print(f"Query: '{test_query}'\n")
for r in results:
    print(f"  #{r['rank']} {r['product']['name']} "
          f"({r['product']['color_name']} {r['product']['shape']}) "
          f"— similarity: {r['similarity']:.4f}")

# Visualize retrieval results
def show_retrieval_results(query, results):
    """Display retrieved images side by side."""
    n = len(results)
    fig, axes = plt.subplots(1, n, figsize=(5 * n, 5))
    if n == 1:
        axes = [axes]
    for ax, r in zip(axes, results):
        img = Image.open(r["path"])
        ax.imshow(img)
        ax.set_title(f"#{r['rank']} {r['product']['name']}\nsim={r['similarity']:.3f}", fontsize=11)
        ax.axis("off")
    fig.suptitle(f"Query: \"{query}\"", fontsize=13, fontweight="bold")
    plt.tight_layout()
    plt.show()


show_retrieval_results(test_query, results)

5) Multimodal Answer Generation¶

This is the “Generation” half of Image RAG — the step where a multimodal LLM actually looks at the retrieved images and produces a grounded answer. Unlike document RAG where the LLM receives text chunks, here the model receives actual image pixels alongside product metadata and the user’s question.

We use llm_cascade’s generate_multimodal() method, which accepts image paths (or bytes, or PIL images) and routes them to the first available vision-capable provider. Each provider has its own serialization format — Gemini wraps images in Part.from_bytes objects while OpenAI-compatible APIs encode them as base64 data URIs in an image_url block — but the cascade hides those details so the notebook just passes a list of images. If Gemini is rate-limited or returns a 503, the cascade automatically falls back to the next vision-capable provider (Ollama Cloud, HuggingFace, or OpenRouter). See the Optional section at the end of the notebook for what the raw Gemini call looks like if you were to write it by hand.

ANSWER_SYSTEM_PROMPT = (
    "You are a helpful product catalog assistant. Use ONLY the provided images and "
    "product metadata to answer the user's question. If the retrieved products do not "
    "contain enough information, say so. Do not invent products that are not shown."
)


def build_metadata_context(results):
    """Build a text description of retrieved products for the prompt."""
    lines = ["Retrieved products (ranked by visual similarity):"]
    for r in results:
        p = r["product"]
        lines.append(
            f"  #{r['rank']} {p['name']}: {p['color_name']} {p['shape']}, "
            f"{p['pattern']} pattern, ${p['price']:.2f}, category={p['category']} "
            f"(similarity={r['similarity']:.3f})"
        )
    return "\n".join(lines)


def generate_multimodal_answer(question, results):
    """Send retrieved images + metadata to a multimodal LLM via llm_cascade.

    The cascade tries vision-capable providers in order
    (Gemini -> Ollama -> HuggingFace -> OpenRouter) and falls back
    automatically on quota, 503, or overload errors.
    """
    metadata_text = build_metadata_context(results)
    image_paths = [r["path"] for r in results]
    prompt = f"{metadata_text}\n\nQUESTION: {question}"
    response = llm.generate_multimodal(
        prompt, images=image_paths, system_prompt=ANSWER_SYSTEM_PROMPT)
    return response.text, response.provider


print("Multimodal answer generation ready (via llm_cascade).")

6) Full Image RAG Pipeline¶

Now we connect the two halves of the system -- retrieval and generation -- into a single end-to-end function. Given a natural-language question, the pipeline embeds the query with CLIP, retrieves the most relevant product images, and then hands those images (along with their metadata) to a multimodal LLM for a grounded answer.

We also define a “without RAG” baseline that sends the same question directly to the LLM with no image context at all. Comparing the two reveals a striking pattern: without retrieval, the model has no knowledge of our product catalog and either refuses to answer or invents products that do not exist. This is the same hallucination problem we saw in document RAG, but it becomes even more vivid when the “ground truth” is visual.

def answer_with_image_rag(question, top_k=3):
    """Full Image RAG pipeline: embed query -> retrieve images -> multimodal answer."""
    print(f"  [Step 1] Embedding query with CLIP ...")
    results = retrieve_images(
        question, image_embeddings, image_paths, PRODUCTS,
        clip_model, clip_processor, top_k=top_k,
    )
    print(f"  [Step 2] Retrieved {len(results)} images:")
    for r in results:
        print(f"           #{r['rank']} {r['product']['name']} (sim={r['similarity']:.3f})")

    print(f"  [Step 3] Generating multimodal answer ...")
    answer, provider = generate_multimodal_answer(question, results)
    print(f"  [Step 4] Answer generated (provider: {provider})")

    return answer, results, provider


def answer_without_rag(question):
    """Direct LLM answer with no image context."""
    answer, provider = generate_text(prompt=question)
    return answer, provider


print("Full Image RAG pipeline ready.")

7) Run End-to-End Examples¶

We test several questions and compare:

With Image RAG: CLIP retrieval + multimodal grounded answer
Without RAG: direct LLM answer (no product images or metadata)

Watch for hallucinations in the “without RAG” answers — the model has no access to our product catalog.

questions = [
    "What red products do you have?",
    # "Show me the cheapest circular product.",
   # "Which products have a striped pattern?",
    # "I need a green product. What are my options?",
    "What is the most expensive product in the catalog?",
]


def preview(text, max_len=800):
    text = text or ""
    return text[:max_len] + ("..." if len(text) > max_len else "")


if not has_llm_provider():
    print("Error: No LLM API key configured. Set at least one API key in Colab Secrets.")
else:
    for i, q in enumerate(questions, start=1):
        print("\n" + "=" * 80)
        print(f"Q{i}. {q}\n")

        try:
            answer_rag, retrieved, prov = answer_with_image_rag(q)
            show_retrieval_results(q, retrieved)
            print("\n  [WITH IMAGE RAG]")
            print(preview(answer_rag))
        except Exception as e:
            print(f"  RAG error: {e}")

        try:
            answer_direct, prov_direct = answer_without_rag(q)
            print(f"\n  [WITHOUT RAG] (provider: {prov_direct})")
            print(preview(answer_direct))
        except Exception as e:
            print(f"  Direct error: {e}")

Checkpoint: Reflection Questions¶

Compare the RAG and non-RAG answers. Which ones hallucinate products that don’t exist?
How does CLIP’s cross-modal embedding enable text-to-image search without manual tagging?
What happens when you query for something not in the catalog (e.g., “black hexagonal product”)?
How does this Image RAG pattern compare to the DBMS RAG and vector-store RAG approaches?

8) Interactive Query (Optional)¶

Now it is your turn. The cell below lets you ask your own questions against the product image catalog. Try queries that test different aspects of CLIP’s cross-modal understanding: ask about colors, shapes, patterns, price ranges, or combinations of attributes. You might also try querying for something that does not exist in the catalog (e.g., “black hexagonal product”) to see how the system handles out-of-distribution queries -- it will still return the “closest” matches, but with lower similarity scores.

# Change this question to anything you want to ask about the product catalog.
my_question = "What purple or blue products do you have, and which is cheaper?"

if has_llm_provider():
    print(f"Question: {my_question}\n")
    try:
        answer, retrieved, provider = answer_with_image_rag(my_question)
        show_retrieval_results(my_question, retrieved)
        print(f"\nAnswer ({provider}):\n{answer}")
    except Exception as e:
        print(f"Error: {e}")
else:
    print("No LLM API key configured. Set at least one API key in Colab Secrets.")

9) Teaching Notes and Exercises¶

Key takeaways:

RAG extends naturally to images and other modalities via cross-modal embeddings like CLIP.
CLIP embeds text and images into a shared vector space, enabling text-to-image retrieval without manual tagging.
The retrieval step uses the same cosine similarity as document RAG, but operates on image embeddings.
Multimodal LLMs can directly “see” retrieved images, producing answers grounded on visual evidence.
Even with synthetic data, the pipeline demonstrates real-world patterns used in e-commerce, medical imaging, and visual search.

Exercises:

Replace the synthetic images with real product photos (e.g., from a Kaggle dataset) and observe how retrieval quality changes.
Experiment with different CLIP model sizes (clip-vit-large-patch14) and compare retrieval accuracy.
Add more products with similar colors/shapes and test whether the system can still distinguish them.
Implement a re-ranking step where the LLM re-orders retrieved images by relevance before answering.
Discuss: when would you prefer Image RAG over text-based RAG for a business application?

Key takeaways¶

Image RAG extends retrieval-augmented generation from text to visual data by swapping text embeddings for multimodal ones.
CLIP maps images and text into a shared 512-dimensional vector space, letting you search an image catalog with natural-language queries.
Cosine similarity over normalized CLIP embeddings is all you need for retrieval -- a single matrix multiply produces the ranked list.
Multimodal LLMs like Gemini can ingest the retrieved image pixels directly, producing answers grounded in what the model actually “sees”.
Without RAG, the LLM has no catalog knowledge and will either refuse or hallucinate products that do not exist.

Optional: Raw Gemini Multimodal API Call¶

Above, llm.generate_multimodal() hid the provider-specific details of sending an image to an LLM. If you want to see what the cascade is doing internally, or you need to call Gemini’s multimodal API directly to use a Gemini-only feature, here is what that looks like.

Each provider has its own serialization format. Gemini’s Python SDK wraps image bytes in a Part.from_bytes object, while OpenAI-compatible providers (OpenRouter, Ollama Cloud, HuggingFace) expect base64-encoded data URIs inside a JSON image_url block. The cascade converts between these on your behalf; the direct call below skips that abstraction.

# Direct Gemini multimodal call -- the same shape the cascade uses internally
# for Gemini, but without fallback to other providers.
from google import genai
from google.genai import types as genai_types
from llm_cascade.providers import _get_key

gemini_key = _get_key("GEMINI_API_KEY")
if not gemini_key:
    print("Skipping: GEMINI_API_KEY not configured.")
else:
    question = "What color is the product shown?"
    image_bytes = Path(image_paths[0]).read_bytes()

    client = genai.Client(api_key=gemini_key)
    parts = [
        genai_types.Part.from_text(text=question),
        genai_types.Part.from_bytes(data=image_bytes, mime_type="image/png"),
    ]
    resp = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=parts,
        config=genai_types.GenerateContentConfig(
            thinking_config=genai_types.ThinkingConfig(thinking_budget=0),
        ),
    )
    print(f"Question: {question}")
    print(f"Image:    {Path(image_paths[0]).name}")
    print(f"\nAnswer:\n{resp.text}")

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

Estimated run time: ~5 minutes on T4 GPU (CLIP model loading + LLM calls)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/Image_RAG.ipynb