Difference: LoRA vs QLoRA (summary)

LoRA (Low-Rank Adaptation)

Technique: Injects small trainable low-rank matrices (adapters) into selected weight matrices of the model.
Memory / compute: Only adapter parameters are trained; base model remains frozen. Very memory efficient compared to full fine-tuning.
Accuracy: Keeps full-precision base model; small accuracy impact when tuned properly.
Use case: Good for resource-constrained fine-tuning where base model is in full precision or fp16.

QLoRA (Quantized LoRA)

Technique: First quantize the base model to 4-bit (or k-bit) using optimized quantization (e.g., bitsandbytes NF4 + double quantization) then apply LoRA adapters and train only those adapters.
Memory / compute: Much lower memory footprint for the base model (enables using very large models on smaller GPUs). Slightly more CPU/GPU work for quantization ops but overall enables larger models to be fine-tuned.
Accuracy: When using modern quantization schemes (NF4 + proper calibration), accuracy is often close to fp16/32 baselines for many tasks.
Use case: When you want to fine-tune very large models (7B, 13B, etc.) on limited GPU memory. QLoRA is the tool of choice for cost/compute constrained training of large models.

Tradeoffs: LoRA alone is simpler and may be preferable for small/medium models. QLoRA allows fine-tuning much larger models with similar final adapter sizes but requires bitsandbytes and special k-bit support.

Common tooling and installs

Run this once in your environment (Colab, local GPU, VM):


pip install transformers datasets accelerate peft bitsandbytes
# If you plan to use Trainer from transformers:
pip install "optimum[export]"   # optional: for optimized export/runtime

Notes:

Use accelerate to run on multi-GPU or better device mapping.
QLoRA requires bitsandbytes (bnb) compiled for your system; in many Colab / modern Linux environments pip install bitsandbytes works.

Implementation 1 — LoRA (basic, fp16 / float)

This example uses peft LoRA with a small model (gpt2) to demonstrate the flow. Replace model name with any causal LM supported by transformers.


# LoRA fine-tuning example (PEFT)
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 1) Load dataset (e.g., Alpaca-like) and format into single text column
dataset = load_dataset("yahma/alpaca-cleaned")  # example dataset
def format_example(ex):
    if ex.get("input"):
        return {"text": f"### Instruction:\n{ex['instruction']}\n### Input:\n{ex['input']}\n### Response:\n{ex['output']}"}
    else:
        return {"text": f"### Instruction:\n{ex['instruction']}\n### Response:\n{ex['output']}"}
dataset = dataset.map(format_example)

# 2) Tokenizer and model
model_name = "gpt2"  # replace with your chosen model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)
# (Optional) move model to fp16 for training if GPU supports: Trainer will handle .half() via fp16 flag

# 3) Prepare LoRA config and wrap model
lora_config = LoraConfig(
    r=16,                # rank
    lora_alpha=32,
    target_modules=["c_attn", "q_proj", "v_proj"] ,  # target module names depend on model architecture; adjust as needed
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# 4) Tokenize dataset
def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])

# 5) Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

# 6) TrainingArguments + Trainer
training_args = TrainingArguments(
    output_dir="./lora-gpt2",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    logging_steps=50,
    save_steps=500,
    learning_rate=2e-5,
    fp16=True,  # if available
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# 7) Train
trainer.train()

# 8) Save only PEFT adapters (small)
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")

Notes:

target_modules needs to target the linear layers for the model architecture. For GPT-2 names differ; for Llama-like models you typically target q_proj, k_proj, v_proj, o_proj, etc. Check model internals (model.named_modules()) to find exact names.
LoRA training stores a small adapter file; inference uses base model + loaded adapter.

Implementation 2 — QLoRA (4-bit base model + LoRA adapters)

This example demonstrates QLoRA: load a model in 4-bit with bitsandbytes then apply PEFT LoRA. Use larger models to see the real benefit (7B+), but the same pattern works on smaller ones for testing.


# QLoRA fine-tuning example (bitsandbytes + PEFT)
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig

# 1) Dataset formatting (same as LoRA)
dataset = load_dataset("yahma/alpaca-cleaned")
def format_example(ex):
    if ex.get("input"):
        return {"text": f"### Instruction:\n{ex['instruction']}\n### Input:\n{ex['input']}\n### Response:\n{ex['output']}"}
    else:
        return {"text": f"### Instruction:\n{ex['instruction']}\n### Response:\n{ex['output']}"}
dataset = dataset.map(format_example)

# 2) Tokenizer
model_name = "gpt2"  # for real QLoRA test use a larger model like "facebook/opt-6.7b" or "tiiuae/falcon-7b" if you have access
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 3) bitsandbytes config for 4-bit loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",   # nf4 is recommended for LLMs
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype="bfloat16"  # or "float16" depending on your hardware
)

# 4) Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",         # requires accelerate or transformers native device map support
    quantization_config=bnb_config,
    trust_remote_code=False,
)

# 5) Prepare model for k-bit training (adds necessary wrappers, enables gradient checkpointing, etc.)
prepare_model_for_kbit_training(model)

# 6) Apply PEFT LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # adjust per architecture
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# 7) Tokenize and data collator
def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
tokenized = dataset.map(tokenize_fn, batched=True)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

# 8) TrainingArguments + Trainer
training_args = TrainingArguments(
    output_dir="./qlora-model",
    num_train_epochs=2,
    per_device_train_batch_size=4,   # can be larger due to 4-bit memory savings
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# 9) Train
trainer.train()

# 10) Save the PEFT adapter & (optionally) push to HF hub
model.save_pretrained("./qlora-adapter")
tokenizer.save_pretrained("./qlora-adapter")

Important notes for QLoRA:

For best results on larger models, use bnb_4bit_quant_type="nf4" and bnb_4bit_use_double_quant=True. These are recommended settings to reduce quantization loss.
Use device_map="auto" (requires accelerate and recent transformers) or set device_map manually.
On some setups, bnb_4bit_compute_dtype="bfloat16" is better if hardware supports it; otherwise use "float16".
QLoRA is most useful for models that would otherwise not fit in memory (7B+). For small models the overhead might not be worth it.

Best practices & tips

Choose correct target modules: Module names differ between architectures; inspect model.named_modules() to identify linear/projection layers to target.
Batch size: QLoRA typically allows larger batch sizes due to reduced model memory.
Learning rate & warmup: LoRA/QLoRA are sensitive to LR; common start points: 1e-4 to 5e-4 for LoRA on large models, lower (1e-5 to 5e-5) on smaller models. Monitor with a validation set.
Save adapters only: Saving only PEFT adapters yields small files and lets you combine with the original base model at inference. Use model.save_pretrained(path) for the wrapped model; HF peft will store the adapter weights.
Inference: For inference with adapters, load base model (full or quantized) and then PeftModel.from_pretrained(base_model, adapter_path) or load wrapped model with both combined.
Hardware: QLoRA works best on GPUs supporting efficient 4-bit ops (modern CUDA + correct bitsandbytes build). On multi-GPU, use accelerate for consistent device mapping.

If you want, I can:

Produce a short script you can run in Google Colab (ready-to-run, with model choice and minimal modifications).
Provide a small section that shows how to load only the adapter for inference and run generation.
Show exact model.named_modules() checks for a specific model you plan to use (tell me the model name), so target_modules is correct.

Search This Blog

vikram aditya