QLoRA Fine-Tuning Demo: Teaching a Small LLM New Knowledge

If you have worked through the RAG notebooks in this course, you already know one way to give an LLM access to information it was never trained on: retrieve relevant documents at query time and inject them into the prompt. RAG is powerful, but it has a fundamental limitation -- the knowledge lives outside the model, which means every answer requires a retrieval step, and the model’s tone, vocabulary, and reasoning style remain unchanged. Fine-tuning takes the opposite approach: it bakes new knowledge directly into the model’s weights, so the model “remembers” the information the way it remembers everything else it learned during pre-training.

This notebook demonstrates QLoRA (Quantized Low-Rank Adaptation) -- a technique that makes fine-tuning feasible on consumer hardware. The “Q” stands for quantization: we compress the base model from 16-bit floating point down to 4-bit integers, reducing memory by roughly 75%. The “LoRA” stands for Low-Rank Adaptation: instead of updating all 315 million parameters, we freeze the entire model and inject small trainable matrices (adapters) into the attention layers. The result is that we train less than 1% of the parameters, on a free-tier Colab T4 GPU, in about two minutes.

What we will do:

Load a small pre-trained model (Qwen2.5-0.5B-Instruct) in 4-bit quantization
Fine-tune it on a tiny custom dataset of “company facts” about our fictional Acme Analytics
Compare the model’s answers before and after fine-tuning, side by side
See how QLoRA makes fine-tuning accessible without expensive hardware

Why this matters for business: Companies can customize LLMs with proprietary data -- product specs, internal policies, domain terminology -- and deploy them as lightweight, latency-free knowledge sources that do not require a retrieval pipeline at inference time. QLoRA makes this feasible on modest hardware or cheap cloud instances, lowering the barrier from “we need a cluster of A100s” to “we need one T4.”

Requirements: Google Colab with T4 GPU (Runtime -> Change runtime type -> T4 GPU)

!pip install -q transformers==4.46.3 accelerate peft bitsandbytes datasets trl

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.1/44.1 kB 1.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.0/10.0 MB 66.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.7/60.7 MB 18.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 348.0/348.0 kB 31.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 566.4/566.4 kB 33.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 77.6 MB/s eta 0:00:00

The cell above installs the key libraries we need. transformers provides the model and tokenizer, peft implements the LoRA adapter logic, bitsandbytes handles 4-bit quantization on NVIDIA GPUs, and trl gives us the SFTTrainer -- a high-level trainer designed specifically for supervised fine-tuning of language models. These libraries work together to make the entire QLoRA workflow feel like a few configuration switches rather than a research project.

Check GPU Availability¶

QLoRA requires a GPU. On Colab, go to Runtime → Change runtime type → T4 GPU.

import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f'GPU: {gpu_name} ({gpu_mem:.1f} GB)')
else:
    print('WARNING: No GPU detected! QLoRA requires a GPU.')
    print('Go to Runtime -> Change runtime type -> T4 GPU')

GPU: Tesla T4 (15.6 GB)

Load the Base Model in 4-bit Quantization¶

We load Qwen2.5-0.5B-Instruct — a small but capable instruction-tuned model. Using BitsAndBytesConfig, we load it in 4-bit quantization (NF4 format), which:

Reduces memory from ~1 GB (FP16) to ~0.3 GB (4-bit)
Enables fine-tuning on a free T4 GPU (15 GB VRAM)
Uses double quantization for extra memory savings

The 0.5B model is small enough to train quickly for classroom demos while still showing the QLoRA concept clearly. For production use cases, you’d typically use larger models (1.5B, 7B, or more).

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_NAME = 'Qwen/Qwen2.5-0.5B-Instruct'  # smaller model -> faster training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',            # NormalFloat4 — best for fine-tuning
    bnb_4bit_compute_dtype=torch.float16,  # compute in FP16
    bnb_4bit_use_double_quant=True,        # quantize the quantization constants too
)

print(f'Loading {MODEL_NAME} in 4-bit...')
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

total_params = sum(p.numel() for p in model.parameters())
print(f'Model loaded: {total_params/1e6:.0f}M parameters (4-bit quantized)')
print(f'GPU memory used: {torch.cuda.memory_allocated()/1e9:.2f} GB')

Loading Qwen/Qwen2.5-0.5B-Instruct in 4-bit...

/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:104: UserWarning: 
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).
  warnings.warn(

Model loaded: 315M parameters (4-bit quantized)
GPU memory used: 0.46 GB

Test the Model BEFORE Fine-Tuning¶

Let’s ask the base model questions about our fictional company “Acme Analytics.” Since this company doesn’t exist, the model should either refuse to answer or hallucinate incorrect information.

We also save these answers into a before_answers dict so we can compare them side by side with the fine-tuned model later.

def ask_model(question, max_new_tokens=150):
    """Ask the model a question and return the answer."""
    messages = [{'role': 'user', 'content': question}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors='pt').to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,           # greedy decoding for reliable factual recall
            pad_token_id=tokenizer.eos_token_id,
        )
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()


# Questions to test both BEFORE and AFTER fine-tuning (for side-by-side comparison)
test_questions = [
    'Who is the CEO of Acme Analytics?',
    'What is the ARR of Acme Analytics?',
    #'What databases does InsightBoard Pro support?',
    #'How many employees does Acme Analytics have?',
    #'What is the mission of Acme Analytics?',
]

# Capture the BEFORE answers now, while the model is still the base model
before_answers = {}
print('=== BEFORE Fine-Tuning ===')
for q in test_questions:
    answer = ask_model(q)
    before_answers[q] = answer
    print(f'Q: {q}')
    print(f'A: {answer}')
    print()

=== BEFORE Fine-Tuning ===

/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:595: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:612: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `20` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(

Q: Who is the CEO of Acme Analytics?
A: I'm sorry for any confusion, but I am Qwen, not Acme Analytics. As an AI language model, my expertise lies in understanding and generating text based on the input I receive, rather than being directly involved with specific companies or organizations like Acme Analytics. If you have any questions about technology, business strategies, or general knowledge, feel free to ask!

Q: What is the ARR of Acme Analytics?
A: I'm sorry for any confusion, but I am Qwen, not Acme Analytics. As an AI language model, my expertise lies in understanding and providing information on various topics related to natural language processing (NLP), machine learning, and other areas of computer science. If you have any specific questions or need assistance with a particular topic, feel free to ask!

Prepare the Training Dataset¶

We create a small dataset of company facts in a conversational format. Each example is a question-answer pair about Acme Analytics.

In a real scenario, you would:

Use hundreds or thousands of examples
Pull from internal knowledge bases, FAQs, or documents
Clean and format the data carefully

For this demo, we use just ~15 examples to keep training fast.

from datasets import Dataset

# Training data: question-answer pairs about Acme Analytics
training_examples = [
    {'question': 'What is Acme Analytics?',
     'answer': 'Acme Analytics Inc. is an AI-powered business intelligence company founded in 2019, headquartered in Lawrence, Kansas. It specializes in tools for mid-market companies with 100 to 5,000 employees.'},

    {'question': 'Who founded Acme Analytics?',
     'answer': 'Acme Analytics was founded by Dr. Sarah Chen and Marcus Rivera in 2019 in Lawrence, Kansas.'},

    {'question': 'Where is Acme Analytics headquartered?',
     'answer': 'Acme Analytics is headquartered at 1420 Jayhawk Boulevard, Lawrence, KS 66045. It also has offices in Chicago (sales) and Austin (engineering).'},

    {'question': 'What is InsightBoard Pro?',
     'answer': 'InsightBoard Pro is the flagship product of Acme Analytics. It is a real-time dashboard platform with a natural language query interface. Pricing is $49/user/month for Standard and $89/user/month for Enterprise.'},

    {'question': 'What is the pricing for InsightBoard Pro?',
     'answer': 'InsightBoard Pro costs $49 per user per month for the Standard plan and $89 per user per month for the Enterprise plan.'},

    {'question': 'What is DataPipe ETL?',
     'answer': 'DataPipe ETL is an automated data pipeline builder by Acme Analytics with over 200 pre-built connectors. Pricing starts at $500/month for up to 10 million records per day. It launched in September 2024.'},

    {'question': 'What is PredictIQ?',
     'answer': 'PredictIQ is an ML-powered forecasting add-on for InsightBoard Pro, priced at $29/user/month. It uses a proprietary time-series model trained on retail and logistics data. It beta launched in March 2025.'},

    {'question': 'What was Acme Analytics revenue in 2025?',
     'answer': 'Acme Analytics reported annual revenue of $42.3 million for fiscal year 2025, with Q4 2025 revenue of $12.8 million.'},

    {'question': 'How many employees does Acme Analytics have?',
     'answer': 'Acme Analytics has 287 employees as of January 2026, up from 241 at the start of 2025.'},

    {'question': 'Who is the CEO of Acme Analytics?',
     'answer': 'The CEO of Acme Analytics is Dr. Sarah Chen, who co-founded the company in 2019.'},

    {'question': 'Who is the CTO of Acme Analytics?',
     'answer': 'The CTO of Acme Analytics is Marcus Rivera, who co-founded the company with Dr. Sarah Chen.'},

    {'question': 'What is the mission of Acme Analytics?',
     'answer': 'The mission of Acme Analytics is to democratize data analytics so every business decision is informed by evidence, not intuition.'},

    {'question': 'How many customers does Acme Analytics have?',
     'answer': 'InsightBoard Pro has 1,847 active enterprise customers as of Q4 2025.'},

    {'question': 'What is the ARR of Acme Analytics?',
     'answer': 'Acme Analytics had an Annual Recurring Revenue (ARR) of $48.6 million at the end of 2025, up 34% year-over-year.'},

    {'question': 'What databases does InsightBoard Pro support?',
     'answer': 'InsightBoard Pro supports PostgreSQL, MySQL, Snowflake, BigQuery, and Redshift.'},
]


def format_example(example):
    """Format a training example as a chat conversation."""
    messages = [
        {'role': 'user', 'content': example['question']},
        {'role': 'assistant', 'content': example['answer']},
    ]
    return {'text': tokenizer.apply_chat_template(messages, tokenize=False)}


dataset = Dataset.from_list(training_examples)
dataset = dataset.map(format_example)

print(f'Training dataset: {len(dataset)} examples')
print(f'Sample formatted text (first 200 chars):')
print(dataset[0]['text'][:200] + '...')

Training dataset: 15 examples
Sample formatted text (first 200 chars):
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
What is Acme Analytics?<|im_end|>
<|im_start|>assistant
Acme Analytics Inc. is an AI-...

Configure QLoRA¶

This is where the magic happens. We configure LoRA (Low-Rank Adaptation):

r=16: Rank of the adaptation matrices (higher = more capacity, more memory)
lora_alpha=32: Scaling factor (typically 2x the rank)
target_modules: Which layers to adapt (attention layers are most effective)
lora_dropout=0.05: Small dropout for regularization

With these settings, we only train ~0.5% of the model’s parameters — the rest stay frozen in 4-bit.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare the quantized model for training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # rank of adaptation matrices
    lora_alpha=32,                 # scaling factor
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],  # attention layers
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Show trainable vs total parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
pct = 100 * trainable_params / total_params

print(f'Total parameters:     {total_params/1e6:.1f}M')
print(f'Trainable parameters: {trainable_params/1e6:.1f}M ({pct:.2f}%)')
print(f'Frozen parameters:    {(total_params-trainable_params)/1e6:.1f}M')
print(f'GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB')

Total parameters:     317.3M
Trainable parameters: 2.2M (0.68%)
Frozen parameters:    315.1M
GPU memory: 0.75 GB

Fine-Tune the Model¶

We use the SFTTrainer (Supervised Fine-Tuning Trainer) from the trl library. Training settings for a small-dataset memorization demo:

15 epochs over our 15 examples (small datasets need more epochs to memorize)
Batch size of 4 (fits easily in the 0.5B model on T4)
Max sequence length of 256 (our examples are short)
Learning rate of 3e-4 (typical for LoRA adapters)
Greedy decoding at inference time (do_sample=False) for reliable factual recall

On a T4 GPU, this should take ~2 minutes.

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir='./qloRA_output',
    num_train_epochs=15,                # enough for memorization
    per_device_train_batch_size=4,      # larger batch fits easily in 0.5B model
    gradient_accumulation_steps=1,      # no accumulation needed
    learning_rate=3e-4,
    warmup_steps=5,
    logging_steps=5,
    save_strategy='no',
    fp16=True,
    max_seq_length=256,                 # our examples are short; reduces compute
    dataset_text_field='text',
    report_to='none',
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)

print('Starting fine-tuning...')
trainer.train()
print('Fine-tuning complete!')
print(f'GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB')

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...

Starting fine-tuning...

/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

Fine-tuning complete!
GPU memory: 0.78 GB

Side-by-Side Comparison¶

Now let’s compare the before and after answers in the same view. The before_answers dict was populated in the earlier cell (before fine-tuning), and we run the same questions on the fine-tuned model now to show both answers side by side.

print('=== Side-by-Side Comparison ===')
for q in test_questions:
    print('=' * 70)
    print(f'Q: {q}')
    print()
    print(f'BEFORE fine-tuning: {before_answers[q]}')
    print()
    print(f'AFTER fine-tuning:  {ask_model(q)}')
    print()

=== Side-by-Side Comparison ===
======================================================================
Q: Who is the CEO of Acme Analytics?

BEFORE fine-tuning: I'm sorry for any confusion, but I am Qwen, not Acme Analytics. As an AI language model, my expertise lies in understanding and generating text based on the input I receive, rather than being directly involved with specific companies or organizations like Acme Analytics. If you have any questions about technology, business strategies, or general knowledge, feel free to ask!

/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:595: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:612: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `20` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/utils/checkpoint.py:232: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  check_backward_validity(args)

AFTER fine-tuning:  The CEO of Acme Analytics is Dr. Sarah Chen, who co-founded the company in 2019.

======================================================================
Q: What is the ARR of Acme Analytics?

BEFORE fine-tuning: I'm sorry for any confusion, but I am Qwen, not Acme Analytics. As an AI language model, my expertise lies in understanding and providing information on various topics related to natural language processing (NLP), machine learning, and other areas of computer science. If you have any specific questions or need assistance with a particular topic, feel free to ask!

AFTER fine-tuning:  Acme Analytics had an Annual Recurring Revenue (ARR) of $48.5 million at the end of 2025, up 37% year-over-year.

Key Takeaways¶

What we demonstrated:

QLoRA lets you fine-tune a 0.5B parameter model on a free Colab T4 GPU in ~2 minutes
We only trained ~0.5% of the parameters (LoRA adapters), keeping 99.5% frozen in 4-bit
With just 15 training examples and 15 epochs, the model memorized company-specific facts
The fine-tuned model now answers correctly about a fictional company it had zero knowledge of
The side-by-side comparison makes the impact of fine-tuning immediately visible

QLoRA vs. RAG — When to use which:

Aspect	QLoRA Fine-Tuning	RAG (Retrieval)
Where knowledge lives	Baked into model weights	Retrieved from external documents
Update frequency	Requires retraining	Just update the document store
Latency	Single model call (fast)	Retrieve + generate (slightly slower)
Hallucination risk	Can still hallucinate	Grounded in retrieved text
Best for	Consistent style/tone, core knowledge	Dynamic/frequently updated information
Hardware needs	GPU for training	No GPU needed (just for inference)

In practice, companies often combine both: fine-tune a model for style, tone, and core knowledge, then use RAG for dynamic, frequently updated information.

Exercises:

Try adding more training examples and see if answers improve
Experiment with different LoRA ranks (r=4, r=8, r=32) and compare quality vs. training time
Try scaling up to Qwen2.5-1.5B-Instruct and compare answer quality vs. training time
Try fine-tuning on a different task (e.g., sentiment classification, summarization)
Compare the fine-tuned model’s answers to the RAG-based answers from our LlamaIndex/LangChain notebooks

Run the code¶

To run this notebook, copy the URL below into your browser’s address bar. The link opens the notebook directly in Google Colab. (If your PDF viewer makes the URL clickable and lands on a broken page, copy the full text manually -- the viewer may have truncated the link at a line break.)

https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/QLoRA_FineTuning.ipynb