- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a complete beginner or an experienced ML tinkerer, this guide walks you through the simplest and most efficient way to fine-tune LLaMA models on free GPUs using Google Colab. Best of all? No fancy hardware or deep ML theory required.
This article breaks down every keyword, library, and function, defining each term precisely but in the simplest language possible.
In this article, you’ll learn how to:
- Install and configure in Colab
- Load models in quantized (4-bit) mode to save memory
- Understand core concepts (parameters, weights, biases, quantization, etc.)
- Apply PEFT and LoRA adapters to fine-tune only a small part of the model
- Prepare Q&A data for training with Hugging Face Datasets and chat templates
- Use SFTTrainer for supervised fine-tuning
- Switch to inference mode for faster generation
- Save and reload your fine-tuned model
**Disclaimer: I promise this will be the friendliest GenAI glossary — your cheat sheet, wittier than autocorrect and way less judgmental! ?
Language Model — Word-Predictor, like a smart autocomplete that predicts the next word based on what came before. It learns by “reading” massive amounts of text, modeling probabilities of word sequences.
Attention — Imagine you’re reading a sentence and highlighting which earlier words matter most to understand each new word. Attention lets the model weigh every word against every other, making predictions more accurate.
Parameter — A number inside a model that can change during learning (like a dial the model tweaks).
Weight — Mostly synonymous with parameter: controls how strongly one part of input affects the output.
Bias — A small extra number added so the model can shift outputs up or down, like a baseline adjustment.Data vs Parameters vs Weights — “Data” is the information used to train a model, “parameters” are the values the model learns from that data, and “weights” are a specific type of parameter representing connection strengths.
Transformer — A model built around attention, letting it “look” at every word in parallel. Introduced in 2017 by Google’s “” paper, Transformers power today’s LLMs.
Quantization — Reducing precision of weights (e.g. 16-bit → 4-bit) to slash memory use, with minimal accuracy loss.
PEFT — ()— updating only tiny adapter layers instead of the whole model.
LoRA — () — Teaches a huge AI model new tricks by tweaking only a tiny part of it. You “freeze” most parameters and insert two small, trainable matrices in each layer; only these matrices learn during fine-tuning, cutting time and compute cost.
LoRA “r” — The adapter’s rank (size). Higher r gives more capacity but uses more memory.
LoRA α (alpha) — A scaling factor for adapter updates — like a “volume knob” for learning strength.
Dropout — Randomly turning off some adapter connections during training to prevent overfitting (can be set to 0).
Gradient Checkpointing — Recomputes parts of the model during backpropagation to halve peak VRAM usage, at a slight speed cost.
4-bit Mode — Quantized mode storing weights in 4 bits, cutting memory by ~4× compared to 16/32-bit.
Inference Mode — After training, use a special mode optimized for fast text generation (≈2× speed).
Overfitting — When a model “memorizes” a tiny dataset and fails on new inputs — always test on unseen data.
Checkpoint — A saved snapshot of model weights you can reload later.
Token — A small chunk of text (thumb rule: ~4–5 characters) — a word, part of a word, punctuation, or symbol — that the model processes.
Tokenizer — The program that “cuts” raw text into tokens and converts each token into a unique ID.
SLMs vs. LLMs
- SLMs (Small Language Models) have fewer parameters and focus on specific tasks or domains — like pocket calculators solving one type of problem.
- LLMs (Large Language Models) are like supercomputers trained on vast, diverse data; they can tackle many tasks — writing essays, summarizing articles, or coding.
- SLMs require less computing power and are ideal on-device; LLMs need massive cloud resources but offer broader versatility.
Why Google Colab & Tesla T4?
- Cost: Free GPU access
- Performance: Tesla T4 handles mid-size LLMs effectively with quantization and PEFT
- Accessibility: No local GPU required — ideal for beginners
# Stable release from PyPI:
!pip install unsloth
# OR
# Install the Nightly (latest GitHub) for cutting-edge features:
!pip uninstall unsloth -y && \
pip install --upgrade --no-cache-dir --no-deps \
git+ \
git+
- pip install unsloth: grabs the vetted, stable version
- uninstall & install: fetches the newest commits from GitHub
We load the Llama 3.2 1B model in 4-bit quantization mode, using roughly one-quarter the memory of full precision so it runs faster on small GPUs.
from unsloth import FastLanguageModel
import torch
# Configuration
max_seq_length = 2048 # How many tokens each input can have
dtype = None # None for auto detection. Float16 for Tesla T4, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage
model_name = "unsloth/Llama-3.2-1B-Instruct"
# Load both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
- FastLanguageModel.from_pretrained: downloads and prepares the model + tokenizer
- max_seq_length: sets the max context length
Instead of updating all model weights, PEFT adds small adapter layers you train. LoRA is one such method:
model = FastLanguageModel.get_peft_model(
model,
r=16, # Adapter rank (size)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "down_proj"
],
lora_alpha=16, # Scales adapter updates
lora_dropout=0, # No dropout
bias="none", # Skip bias updates
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False, # Optional
loftq_config=None, # Optional
)
- r: Higher may improve learning but uses more memory
- target_modules: Where LoRA adapters are added
- lora_alpha: Adjusts strength of LoRA updates
- use_gradient_checkpointing: Saves GPU memory by recomputing during backprop
When preparing datasets for fine-tuning, format multi-turn conversations according to your model’s expected structure.
? LLaMA 3.1: Chat Template Format
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there! How can I assist you today?<|eot_id|>
- <|begin_of_text|> marks the start
- <|start_header_id|> / <|end_header_id|> mark roles
- <|eot_id|> ends each message
- Identify your current dataset format (CSV, ShareGPT, ChatML).
- Convert to a unified ShareGPT-like structure.
- Standardize to Hugging Face format (role/content) with standardize_sharegpt.
- Apply your model’s chat template via get_chat_template and apply_chat_template.
Here’s a gist that shows loading a CSV or Hugging Face ShareGPT dataset:
5. Supervised Fine Tuning with SFTTrainer
Why SFTTrainer? It structures and streamlines the fine-tuning process.
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=nds,
dataset_text_field="text",
max_seq_length=max_seq_length,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=1,
max_steps=100,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
output_dir="outputs",
report_to="none",
),
)
Key settings:
- batch size (per_device_train_batch_size) controls how many examples per step.
- gradient_accumulation_steps simulates larger batches when memory-constrained.
- warmup_steps helps stabilize the learning rate initially.
- optim="adamw_8bit" uses an 8-bit optimizer to save memory.
Wrap the trainer to compute loss only on the model’s responses:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)
# Start training
stats = trainer.train()
You’ll see the loss drop steadily, showing your 4-bit quantized Llama 3.2 learning effectively.
7. Inference & Saving Your Model
Fast Inference Mode
# Wrap for quick replies
model = FastLanguageModel.for_inference(model)
# Prepare and move inputs to GPU
inputs = tokenizer.apply_chat_template(
[{"role":"user","content":"<Your Question Here>"}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
# Generate an answer
outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])
Save & Reload
# Save to Drive
model.save_pretrained("/content/drive/MyDrive/my_llama3_model")
tokenizer.save_pretrained("/content/drive/MyDrive/my_llama3_model")
# Reload in 4-bit mode
from transformers import AutoTokenizer
from unsloth import FastLanguageModel
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/my_llama3_model")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="/content/drive/MyDrive/my_llama3_model",
load_in_4bit=True,
max_seq_length=2048,
)
# Quick test
inputs = tokenizer("What backup options are available for the CavernDB cluster?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Conclusion
Fine-tuning a large language model is all about balancing precision, speed, and the right data.
Precision vs. Quantization
Full-precision models (FP32) use ~4.3B possible values per weight; 4-bit cuts that down to 16 levels (tiny rounding error for massive memory savings).
Why 4-Bit Helps Resources
A 7B-parameter model in FP16 needs ~28 GB RAM; in 4-bit, ~7 GB.
Unsloth’s Speed Boost
Unsloth’s optimized kernels can give ~2× training speed and ~70% VRAM reduction, all on free GPUs.
Picking the Right Dataset
Too small → overfitting. Too big/wild → underfitting. Aim for a focused set of high-quality examples in the right order.
Feedback and questions are always welcome! Dive into the Colab notebook linked in the comments below and give it a spin ?
Further Reading & Resources
- Unsloth GitHub —
- Hugging Face Datasets —
- PEFT Paper (LoRA) —
- Google Colab —
#LLM #fine-tuning #Unsloth #tutorial #PEFT #LoRA #4-bit #quantization #Colab #T4GPU #SFTTrainer #inference #optimization