Difference: LoRA vs QLoRA (summary)

 

Difference: LoRA vs QLoRA (summary)

LoRA (Low-Rank Adaptation)

  • Technique: Injects small trainable low-rank matrices (adapters) into selected weight matrices of the model.

  • Memory / compute: Only adapter parameters are trained; base model remains frozen. Very memory efficient compared to full fine-tuning.

  • Accuracy: Keeps full-precision base model; small accuracy impact when tuned properly.

  • Use case: Good for resource-constrained fine-tuning where base model is in full precision or fp16.

QLoRA (Quantized LoRA)

  • Technique: First quantize the base model to 4-bit (or k-bit) using optimized quantization (e.g., bitsandbytes NF4 + double quantization) then apply LoRA adapters and train only those adapters.

  • Memory / compute: Much lower memory footprint for the base model (enables using very large models on smaller GPUs). Slightly more CPU/GPU work for quantization ops but overall enables larger models to be fine-tuned.

  • Accuracy: When using modern quantization schemes (NF4 + proper calibration), accuracy is often close to fp16/32 baselines for many tasks.

  • Use case: When you want to fine-tune very large models (7B, 13B, etc.) on limited GPU memory. QLoRA is the tool of choice for cost/compute constrained training of large models.

Tradeoffs: LoRA alone is simpler and may be preferable for small/medium models. QLoRA allows fine-tuning much larger models with similar final adapter sizes but requires bitsandbytes and special k-bit support.


Common tooling and installs

Run this once in your environment (Colab, local GPU, VM):

pip install transformers datasets accelerate peft bitsandbytes # If you plan to use Trainer from transformers: pip install "optimum[export]" # optional: for optimized export/runtime

Notes:

  • Use accelerate to run on multi-GPU or better device mapping.

  • QLoRA requires bitsandbytes (bnb) compiled for your system; in many Colab / modern Linux environments pip install bitsandbytes works.


Implementation 1 — LoRA (basic, fp16 / float)

This example uses peft LoRA with a small model (gpt2) to demonstrate the flow. Replace model name with any causal LM supported by transformers.

# LoRA fine-tuning example (PEFT) from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training # 1) Load dataset (e.g., Alpaca-like) and format into single text column dataset = load_dataset("yahma/alpaca-cleaned") # example dataset def format_example(ex): if ex.get("input"): return {"text": f"### Instruction:\n{ex['instruction']}\n### Input:\n{ex['input']}\n### Response:\n{ex['output']}"} else: return {"text": f"### Instruction:\n{ex['instruction']}\n### Response:\n{ex['output']}"} dataset = dataset.map(format_example) # 2) Tokenizer and model model_name = "gpt2" # replace with your chosen model tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained(model_name) # (Optional) move model to fp16 for training if GPU supports: Trainer will handle .half() via fp16 flag # 3) Prepare LoRA config and wrap model lora_config = LoraConfig( r=16, # rank lora_alpha=32, target_modules=["c_attn", "q_proj", "v_proj"] , # target module names depend on model architecture; adjust as needed lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) # 4) Tokenize dataset def tokenize_fn(examples): return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512) tokenized = dataset.map(tokenize_fn, batched=True) tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"]) # 5) Data collator data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # 6) TrainingArguments + Trainer training_args = TrainingArguments( output_dir="./lora-gpt2", num_train_epochs=2, per_device_train_batch_size=2, logging_steps=50, save_steps=500, learning_rate=2e-5, fp16=True, # if available ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], data_collator=data_collator, tokenizer=tokenizer, ) # 7) Train trainer.train() # 8) Save only PEFT adapters (small) model.save_pretrained("./lora-adapter") tokenizer.save_pretrained("./lora-adapter")

Notes:

  • target_modules needs to target the linear layers for the model architecture. For GPT-2 names differ; for Llama-like models you typically target q_proj, k_proj, v_proj, o_proj, etc. Check model internals (model.named_modules()) to find exact names.

  • LoRA training stores a small adapter file; inference uses base model + loaded adapter.


Implementation 2 — QLoRA (4-bit base model + LoRA adapters)

This example demonstrates QLoRA: load a model in 4-bit with bitsandbytes then apply PEFT LoRA. Use larger models to see the real benefit (7B+), but the same pattern works on smaller ones for testing.

# QLoRA fine-tuning example (bitsandbytes + PEFT) from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from transformers import BitsAndBytesConfig # 1) Dataset formatting (same as LoRA) dataset = load_dataset("yahma/alpaca-cleaned") def format_example(ex): if ex.get("input"): return {"text": f"### Instruction:\n{ex['instruction']}\n### Input:\n{ex['input']}\n### Response:\n{ex['output']}"} else: return {"text": f"### Instruction:\n{ex['instruction']}\n### Response:\n{ex['output']}"} dataset = dataset.map(format_example) # 2) Tokenizer model_name = "gpt2" # for real QLoRA test use a larger model like "facebook/opt-6.7b" or "tiiuae/falcon-7b" if you have access tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # 3) bitsandbytes config for 4-bit loading bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # nf4 is recommended for LLMs bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype="bfloat16" # or "float16" depending on your hardware ) # 4) Load model in 4-bit model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", # requires accelerate or transformers native device map support quantization_config=bnb_config, trust_remote_code=False, ) # 5) Prepare model for k-bit training (adds necessary wrappers, enables gradient checkpointing, etc.) prepare_model_for_kbit_training(model) # 6) Apply PEFT LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # adjust per architecture lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(model, lora_config) # 7) Tokenize and data collator def tokenize_fn(examples): return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512) tokenized = dataset.map(tokenize_fn, batched=True) tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"]) data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # 8) TrainingArguments + Trainer training_args = TrainingArguments( output_dir="./qlora-model", num_train_epochs=2, per_device_train_batch_size=4, # can be larger due to 4-bit memory savings learning_rate=2e-4, fp16=True, logging_steps=50, save_steps=500, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], data_collator=data_collator, tokenizer=tokenizer, ) # 9) Train trainer.train() # 10) Save the PEFT adapter & (optionally) push to HF hub model.save_pretrained("./qlora-adapter") tokenizer.save_pretrained("./qlora-adapter")

Important notes for QLoRA:

  • For best results on larger models, use bnb_4bit_quant_type="nf4" and bnb_4bit_use_double_quant=True. These are recommended settings to reduce quantization loss.

  • Use device_map="auto" (requires accelerate and recent transformers) or set device_map manually.

  • On some setups, bnb_4bit_compute_dtype="bfloat16" is better if hardware supports it; otherwise use "float16".

  • QLoRA is most useful for models that would otherwise not fit in memory (7B+). For small models the overhead might not be worth it.


Best practices & tips

  1. Choose correct target modules: Module names differ between architectures; inspect model.named_modules() to identify linear/projection layers to target.

  2. Batch size: QLoRA typically allows larger batch sizes due to reduced model memory.

  3. Learning rate & warmup: LoRA/QLoRA are sensitive to LR; common start points: 1e-4 to 5e-4 for LoRA on large models, lower (1e-5 to 5e-5) on smaller models. Monitor with a validation set.

  4. Save adapters only: Saving only PEFT adapters yields small files and lets you combine with the original base model at inference. Use model.save_pretrained(path) for the wrapped model; HF peft will store the adapter weights.

  5. Inference: For inference with adapters, load base model (full or quantized) and then PeftModel.from_pretrained(base_model, adapter_path) or load wrapped model with both combined.

  6. Hardware: QLoRA works best on GPUs supporting efficient 4-bit ops (modern CUDA + correct bitsandbytes build). On multi-GPU, use accelerate for consistent device mapping.


If you want, I can:

  • Produce a short script you can run in Google Colab (ready-to-run, with model choice and minimal modifications).

  • Provide a small section that shows how to load only the adapter for inference and run generation.

  • Show exact model.named_modules() checks for a specific model you plan to use (tell me the model name), so target_modules is correct.

Comments

Popular posts from this blog

TensorFlow Python tutorial, deep learning with TensorFlow, TensorFlow examples, TensorFlow Keras tutorial, machine learning library Python

SciPy Python tutorial, scientific computing with SciPy, Python SciPy examples, SciPy library functions, SciPy for engineers

PyTorch Python tutorial, deep learning with PyTorch, PyTorch neural network examples, PyTorch GPU, PyTorch for beginners