Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Lomanu4 · 11 Май 2025

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a complete beginner or an experienced ML tinkerer, this guide walks you through the simplest and most efficient way to fine-tune LLaMA models on free GPUs using Google Colab. Best of all? No fancy hardware or deep ML theory required.

This article breaks down every keyword, library, and function, defining each term precisely but in the simplest language possible.

In this article, you’ll learn how to:

Install and configure

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

in Colab
Load models in quantized (4-bit) mode to save memory
Understand core concepts (parameters, weights, biases, quantization, etc.)
Apply PEFT and LoRA adapters to fine-tune only a small part of the model
Prepare Q&A data for training with Hugging Face Datasets and chat templates
Use SFTTrainer for supervised fine-tuning
Switch to inference mode for faster generation
Save and reload your fine-tuned model

Getting Comfortable With some Core Concepts

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

**Disclaimer: I promise this will be the friendliest GenAI glossary — your cheat sheet, wittier than autocorrect and way less judgmental! ?

Language Model — Word-Predictor, like a smart autocomplete that predicts the next word based on what came before. It learns by “reading” massive amounts of text, modeling probabilities of word sequences.

Attention — Imagine you’re reading a sentence and highlighting which earlier words matter most to understand each new word. Attention lets the model weigh every word against every other, making predictions more accurate.

Parameter — A number inside a model that can change during learning (like a dial the model tweaks).

Weight — Mostly synonymous with parameter: controls how strongly one part of input affects the output.

Data vs Parameters vs Weights — “Data” is the information used to train a model, “parameters” are the values the model learns from that data, and “weights” are a specific type of parameter representing connection strengths.

Bias — A small extra number added so the model can shift outputs up or down, like a baseline adjustment.

Transformer — A model built around attention, letting it “look” at every word in parallel. Introduced in 2017 by Google’s “

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

” paper, Transformers power today’s LLMs.

Quantization — Reducing precision of weights (e.g. 16-bit → 4-bit) to slash memory use, with minimal accuracy loss.

PEFT — (

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

)— updating only tiny adapter layers instead of the whole model.

LoRA — (

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

) — Teaches a huge AI model new tricks by tweaking only a tiny part of it. You “freeze” most parameters and insert two small, trainable matrices in each layer; only these matrices learn during fine-tuning, cutting time and compute cost.

LoRA “r” — The adapter’s rank (size). Higher r gives more capacity but uses more memory.

LoRA α (alpha) — A scaling factor for adapter updates — like a “volume knob” for learning strength.

Dropout — Randomly turning off some adapter connections during training to prevent overfitting (can be set to 0).

Gradient Checkpointing — Recomputes parts of the model during backpropagation to halve peak VRAM usage, at a slight speed cost.

4-bit Mode — Quantized mode storing weights in 4 bits, cutting memory by ~4× compared to 16/32-bit.

Inference Mode — After training, use a special mode optimized for fast text generation (≈2× speed).

Overfitting — When a model “memorizes” a tiny dataset and fails on new inputs — always test on unseen data.

Checkpoint — A saved snapshot of model weights you can reload later.

Token — A small chunk of text (thumb rule: ~4–5 characters) — a word, part of a word, punctuation, or symbol — that the model processes.

Tokenizer — The program that “cuts” raw text into tokens and converts each token into a unique ID.

SLMs vs. LLMs

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

SLMs (Small Language Models) have fewer parameters and focus on specific tasks or domains — like pocket calculators solving one type of problem.
LLMs (Large Language Models) are like supercomputers trained on vast, diverse data; they can tackle many tasks — writing essays, summarizing articles, or coding.
SLMs require less computing power and are ideal on-device; LLMs need massive cloud resources but offer broader versatility.

1. Getting Started: Colab Setup

Why Google Colab & Tesla T4?

Cost: Free GPU access
Performance: Tesla T4 handles mid-size LLMs effectively with quantization and PEFT
Accessibility: No local GPU required — ideal for beginners

Installing Unsloth

# Stable release from PyPI:
!pip install unsloth

# OR

# Install the Nightly (latest GitHub) for cutting-edge features:
!pip uninstall unsloth -y && \
pip install --upgrade --no-cache-dir --no-deps \
git+

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

\
git+

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

pip install unsloth: grabs the vetted, stable version
uninstall & install: fetches the newest commits from GitHub

2. Loading a Model Efficiently

We load the Llama 3.2 1B model in 4-bit quantization mode, using roughly one-quarter the memory of full precision so it runs faster on small GPUs.

from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048 # How many tokens each input can have
dtype = None # None for auto detection. Float16 for Tesla T4, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage
model_name = "unsloth/Llama-3.2-1B-Instruct"

# Load both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)

FastLanguageModel.from_pretrained: downloads and prepares the model + tokenizer
max_seq_length: sets the max context length

3. Introducing PEFT & LoRA

Instead of updating all model weights, PEFT adds small adapter layers you train. LoRA is one such method:

model = FastLanguageModel.get_peft_model(
model,
r=16, # Adapter rank (size)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "down_proj"
],
lora_alpha=16, # Scales adapter updates
lora_dropout=0, # No dropout
bias="none", # Skip bias updates
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False, # Optional
loftq_config=None, # Optional
)

r: Higher may improve learning but uses more memory
target_modules: Where LoRA adapters are added
lora_alpha: Adjusts strength of LoRA updates
use_gradient_checkpointing: Saves GPU memory by recomputing during backprop

<|begin_of_text|> marks the start
<|start_header_id|> / <|end_header_id|> mark roles
<|eot_id|> ends each message

? Converting Between Formats

Identify your current dataset format (CSV, ShareGPT, ChatML).
Convert to a unified ShareGPT-like structure.
Standardize to Hugging Face format (role/content) with standardize_sharegpt.
Apply your model’s chat template via get_chat_template and apply_chat_template.

Here’s a gist that shows loading a CSV or Hugging Face ShareGPT dataset:

5. Supervised Fine Tuning with SFTTrainer

Why SFTTrainer? It structures and streamlines the fine-tuning process.

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=nds,
dataset_text_field="text",
max_seq_length=max_seq_length,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=1,
max_steps=100,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
output_dir="outputs",
report_to="none",
),
)

Key settings:

batch size (per_device_train_batch_size) controls how many examples per step.
gradient_accumulation_steps simulates larger batches when memory-constrained.
warmup_steps helps stabilize the learning rate initially.
optim="adamw_8bit" uses an 8-bit optimizer to save memory.

6. Kicking-off the Training

Wrap the trainer to compute loss only on the model’s responses:

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Start training
stats = trainer.train()

You’ll see the loss drop steadily, showing your 4-bit quantized Llama 3.2 learning effectively.

7. Inference & Saving Your Model

Fast Inference Mode

# Wrap for quick replies
model = FastLanguageModel.for_inference(model)

# Prepare and move inputs to GPU
inputs = tokenizer.apply_chat_template(
[{"role":"user","content":"<Your Question Here>"}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")

# Generate an answer
outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])

Save & Reload

# Save to Drive
model.save_pretrained("/content/drive/MyDrive/my_llama3_model")
tokenizer.save_pretrained("/content/drive/MyDrive/my_llama3_model")

# Reload in 4-bit mode
from transformers import AutoTokenizer
from unsloth import FastLanguageModel

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/my_llama3_model")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="/content/drive/MyDrive/my_llama3_model",
load_in_4bit=True,
max_seq_length=2048,
)

# Quick test
inputs = tokenizer("What backup options are available for the CavernDB cluster?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Conclusion

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Fine-tuning a large language model is all about balancing precision, speed, and the right data.

Precision vs. Quantization
Full-precision models (FP32) use ~4.3B possible values per weight; 4-bit cuts that down to 16 levels (tiny rounding error for massive memory savings).
Why 4-Bit Helps Resources
A 7B-parameter model in FP16 needs ~28 GB RAM; in 4-bit, ~7 GB.
Unsloth’s Speed Boost
Unsloth’s optimized kernels can give ~2× training speed and ~70% VRAM reduction, all on free GPUs.
Picking the Right Dataset
Too small → overfitting. Too big/wild → underfitting. Aim for a focused set of high-quality examples in the right order.

Feedback and questions are always welcome! Dive into the Colab notebook linked in the comments below and give it a spin ?

Further Reading & Resources

Unsloth GitHub —

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.
Hugging Face Datasets —

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.
PEFT Paper (LoRA) —

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.
Google Colab —

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

#LLM #fine-tuning #Unsloth #tutorial #PEFT #LoRA #4-bit #quantization #Colab #T4GPU #SFTTrainer #inference #optimization

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Lomanu4