• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
1,481
Баллы
155

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a complete beginner or an experienced ML tinkerer, this guide walks you through the simplest and most efficient way to fine-tune LLaMA models on free GPUs using Google Colab. Best of all? No fancy hardware or deep ML theory required.

This article breaks down every keyword, library, and function, defining each term precisely but in the simplest language possible.

In this article, you’ll learn how to:

  • Install and configure

    Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

    in Colab
  • Load models in quantized (4-bit) mode to save memory
  • Understand core concepts (parameters, weights, biases, quantization, etc.)
  • Apply PEFT and LoRA adapters to fine-tune only a small part of the model
  • Prepare Q&A data for training with Hugging Face Datasets and chat templates
  • Use SFTTrainer for supervised fine-tuning
  • Switch to inference mode for faster generation
  • Save and reload your fine-tuned model
Getting Comfortable With some Core Concepts



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



**Disclaimer: I promise this will be the friendliest GenAI glossary — your cheat sheet, wittier than autocorrect and way less judgmental! ?

Language ModelWord-Predictor, like a smart autocomplete that predicts the next word based on what came before. It learns by “reading” massive amounts of text, modeling probabilities of word sequences.

Attention — Imagine you’re reading a sentence and highlighting which earlier words matter most to understand each new word. Attention lets the model weigh every word against every other, making predictions more accurate.

Parameter — A number inside a model that can change during learning (like a dial the model tweaks).

Weight — Mostly synonymous with parameter: controls how strongly one part of input affects the output.

Data vs Parameters vs Weights — “Data” is the information used to train a model, “parameters” are the values the model learns from that data, and “weights” are a specific type of parameter representing connection strengths.
Bias — A small extra number added so the model can shift outputs up or down, like a baseline adjustment.

Transformer — A model built around attention, letting it “look” at every word in parallel. Introduced in 2017 by Google’s “

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

” paper, Transformers power today’s LLMs.

Quantization — Reducing precision of weights (e.g. 16-bit → 4-bit) to slash memory use, with minimal accuracy loss.

PEFT — (

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

)— updating only tiny adapter layers instead of the whole model.

LoRA — (

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

) — Teaches a huge AI model new tricks by tweaking only a tiny part of it. You “freeze” most parameters and insert two small, trainable matrices in each layer; only these matrices learn during fine-tuning, cutting time and compute cost.

LoRA “r” — The adapter’s rank (size). Higher r gives more capacity but uses more memory.

LoRA α (alpha) — A scaling factor for adapter updates — like a “volume knob” for learning strength.

Dropout — Randomly turning off some adapter connections during training to prevent overfitting (can be set to 0).

Gradient Checkpointing — Recomputes parts of the model during backpropagation to halve peak VRAM usage, at a slight speed cost.

4-bit Mode — Quantized mode storing weights in 4 bits, cutting memory by ~4× compared to 16/32-bit.

Inference Mode — After training, use a special mode optimized for fast text generation (≈2× speed).

Overfitting — When a model “memorizes” a tiny dataset and fails on new inputs — always test on unseen data.

Checkpoint — A saved snapshot of model weights you can reload later.

Token — A small chunk of text (thumb rule: ~4–5 characters) — a word, part of a word, punctuation, or symbol — that the model processes.

Tokenizer — The program that “cuts” raw text into tokens and converts each token into a unique ID.

SLMs vs. LLMs



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



  • SLMs (Small Language Models) have fewer parameters and focus on specific tasks or domains — like pocket calculators solving one type of problem.
  • LLMs (Large Language Models) are like supercomputers trained on vast, diverse data; they can tackle many tasks — writing essays, summarizing articles, or coding.
  • SLMs require less computing power and are ideal on-device; LLMs need massive cloud resources but offer broader versatility.
1. Getting Started: Colab Setup


Why Google Colab & Tesla T4?

  • Cost: Free GPU access
  • Performance: Tesla T4 handles mid-size LLMs effectively with quantization and PEFT
  • Accessibility: No local GPU required — ideal for beginners
Installing Unsloth


# Stable release from PyPI:
!pip install unsloth

# OR

# Install the Nightly (latest GitHub) for cutting-edge features:
!pip uninstall unsloth -y && \
pip install --upgrade --no-cache-dir --no-deps \
git+

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

\
git+

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.


  • pip install unsloth: grabs the vetted, stable version
  • uninstall & install: fetches the newest commits from GitHub
2. Loading a Model Efficiently


We load the Llama 3.2 1B model in 4-bit quantization mode, using roughly one-quarter the memory of full precision so it runs faster on small GPUs.


from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048 # How many tokens each input can have
dtype = None # None for auto detection. Float16 for Tesla T4, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage
model_name = "unsloth/Llama-3.2-1B-Instruct"

# Load both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
  • FastLanguageModel.from_pretrained: downloads and prepares the model + tokenizer
  • max_seq_length: sets the max context length
3. Introducing PEFT & LoRA


Instead of updating all model weights, PEFT adds small adapter layers you train. LoRA is one such method:


model = FastLanguageModel.get_peft_model(
model,
r=16, # Adapter rank (size)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "down_proj"
],
lora_alpha=16, # Scales adapter updates
lora_dropout=0, # No dropout
bias="none", # Skip bias updates
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False, # Optional
loftq_config=None, # Optional
)
  • r: Higher may improve learning but uses more memory
  • target_modules: Where LoRA adapters are added
  • lora_alpha: Adjusts strength of LoRA updates
  • use_gradient_checkpointing: Saves GPU memory by recomputing during backprop
4. Preparing Your Dataset for Training


When preparing datasets for fine-tuning, format multi-turn conversations according to your model’s expected structure.

? LLaMA 3.1: Chat Template Format


<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there! How can I assist you today?<|eot_id|>
  • <|begin_of_text|> marks the start
  • <|start_header_id|> / <|end_header_id|> mark roles
  • <|eot_id|> ends each message
? Converting Between Formats

  1. Identify your current dataset format (CSV, ShareGPT, ChatML).
  2. Convert to a unified ShareGPT-like structure.
  3. Standardize to Hugging Face format (role/content) with standardize_sharegpt.
  4. Apply your model’s chat template via get_chat_template and apply_chat_template.

Here’s a gist that shows loading a CSV or Hugging Face ShareGPT dataset:

5. Supervised Fine Tuning with SFTTrainer


Why SFTTrainer? It structures and streamlines the fine-tuning process.


from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=nds,
dataset_text_field="text",
max_seq_length=max_seq_length,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=1,
max_steps=100,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
output_dir="outputs",
report_to="none",
),
)

Key settings:

  • batch size (per_device_train_batch_size) controls how many examples per step.
  • gradient_accumulation_steps simulates larger batches when memory-constrained.
  • warmup_steps helps stabilize the learning rate initially.
  • optim="adamw_8bit" uses an 8-bit optimizer to save memory.
6. Kicking-off the Training


Wrap the trainer to compute loss only on the model’s responses:


from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

# Start training
stats = trainer.train()

You’ll see the loss drop steadily, showing your 4-bit quantized Llama 3.2 learning effectively.

7. Inference & Saving Your Model


Fast Inference Mode


# Wrap for quick replies
model = FastLanguageModel.for_inference(model)

# Prepare and move inputs to GPU
inputs = tokenizer.apply_chat_template(
[{"role":"user","content":"<Your Question Here>"}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")

# Generate an answer
outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])

Save & Reload


# Save to Drive
model.save_pretrained("/content/drive/MyDrive/my_llama3_model")
tokenizer.save_pretrained("/content/drive/MyDrive/my_llama3_model")

# Reload in 4-bit mode
from transformers import AutoTokenizer
from unsloth import FastLanguageModel

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/my_llama3_model")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="/content/drive/MyDrive/my_llama3_model",
load_in_4bit=True,
max_seq_length=2048,
)

# Quick test
inputs = tokenizer("What backup options are available for the CavernDB cluster?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Conclusion



Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



Fine-tuning a large language model is all about balancing precision, speed, and the right data.


  1. Precision vs. Quantization
    Full-precision models (FP32) use ~4.3B possible values per weight; 4-bit cuts that down to 16 levels (tiny rounding error for massive memory savings).


  2. Why 4-Bit Helps Resources
    A 7B-parameter model in FP16 needs ~28 GB RAM; in 4-bit, ~7 GB.


  3. Unsloth’s Speed Boost
    Unsloth’s optimized kernels can give ~2× training speed and ~70% VRAM reduction, all on free GPUs.


  4. Picking the Right Dataset
    Too small → overfitting. Too big/wild → underfitting. Aim for a focused set of high-quality examples in the right order.

Feedback and questions are always welcome! Dive into the Colab notebook linked in the comments below and give it a spin ?

Further Reading & Resources


#LLM #fine-tuning #Unsloth #tutorial #PEFT #LoRA #4-bit #quantization #Colab #T4GPU #SFTTrainer #inference #optimization


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу