Fine-Tuning LLMs for Production: A Practical Guide
From LoRA to QLoRA — techniques for efficiently customizing large language models for domain-specific tasks.
Sarah Chen
ML Infrastructure Lead
February 5, 2026
14 min read
Why Fine-Tune?
Base LLMs are generalists. For domain-specific tasks — medical diagnosis, legal analysis, code review — fine-tuning dramatically improves performance while reducing hallucination.
Parameter-Efficient Fine-Tuning
LoRA (Low-Rank Adaptation) adds trainable rank-decomposition matrices to frozen model weights:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
load_in_4bit=True,
device_map="auto",
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")
# Trainable params: 0.02% of 70B
Evaluation Beyond Perplexity
Production fine-tuning requires domain-specific evaluation:
The goal isn't just accuracy — it's reliable, safe, and fast domain expertise.