Fine-Tuning Large Language Models: A Practical Guide from GPT to Llama
Fine-Tuning Large Language Models
Fine-tuning LLMs can dramatically improve performance for domain-specific tasks. This guide covers the complete process from data preparation to deployment.
When to Fine-Tune vs. RAG
Before diving in, understand when fine-tuning makes sense:
Use Fine-Tuning When:
- You need consistent behavior and style
- Teaching new formats or structured outputs
- Domain-specific language or terminology
- Improving reasoning on specific task types
Use RAG When:
- You need access to changing information
- Working with large knowledge bases
- Source attribution is important
- Lower cost and faster iteration preferred
Use Both When:
- Complex domain-specific applications
- Need both knowledge and behavior modification
Fine-Tuning OpenAI Models
1. Data Preparation
OpenAI fine-tuning uses JSONL format with system, user, and assistant messages:
import json
training_data = []
for example in your_data:
training_data.append({
"messages": [
{"role": "system", "content": "You are a medical AI assistant specializing in cardiology."},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["answer"]}
]
})
# Save to JSONL
with open("training_data.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")
Data Quality Tips:
- Minimum 50 examples, ideally 500+ for best results
- Ensure examples are diverse and representative
- Validate formatting with OpenAI's validation script
- Balance your dataset across different input types
2. Upload and Create Fine-Tune Job
from openai import OpenAI
client = OpenAI()
# Upload training file
training_file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# Create fine-tuning job
fine_tune_job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
hyperparameters={
"n_epochs": 3,
"batch_size": 1,
"learning_rate_multiplier": 0.1
}
)
print(f"Fine-tune job created: {fine_tune_job.id}")
3. Monitor Training
# Check status
job_status = client.fine_tuning.jobs.retrieve(fine_tune_job.id)
print(f"Status: {job_status.status}")
# List events
events = client.fine_tuning.jobs.list_events(fine_tune_job.id, limit=10)
for event in events.data:
print(event.message)
4. Use Your Fine-Tuned Model
response = client.chat.completions.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a medical AI assistant."},
{"role": "user", "content": "What are the symptoms of myocardial infarction?"}
]
)
print(response.choices[0].message.content)
Fine-Tuning Open-Source Models (Llama, Mistral)
For open-source models, we use Hugging Face's transformers and PEFT (Parameter-Efficient Fine-Tuning).
Setup Environment
pip install transformers datasets peft bitsandbytes accelerate
LoRA Fine-Tuning
LoRA (Low-Rank Adaptation) is efficient and requires less GPU memory:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch
# Load base model in 4-bit
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto",
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Prepare for training
model = prepare_model_for_kbit_training(model)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06
Data Preprocessing
def format_instruction(example):
"""Format your data into instruction-following format"""
instruction = example["instruction"]
input_text = example.get("input", "")
output = example["output"]
if input_text:
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input_text}
### Response:
{output}"""
else:
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}"""
return {"text": prompt}
# Load and process dataset
dataset = load_dataset("your-dataset")
dataset = dataset.map(format_instruction)
# Tokenize
def tokenize(example):
return tokenizer(
example["text"],
truncation=True,
max_length=512,
padding="max_length"
)
tokenized_dataset = dataset.map(tokenize, batched=True)
Training
from transformers import Trainer, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="./llama-2-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
save_total_limit=3,
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit",
warmup_ratio=0.05,
lr_scheduler_type="cosine"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# Start training
trainer.train()
# Save model
trainer.save_model("./final-model")
Advanced Techniques
QLoRA - Even More Efficient
Quantized LoRA for training on consumer GPUs:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
Multi-Task Fine-Tuning
Train on multiple tasks simultaneously:
# Mix different task types in your dataset
tasks = {
"summarization": summarization_data,
"qa": question_answering_data,
"classification": classification_data
}
# Add task prefixes
def add_task_prefix(example, task_name):
example["text"] = f"[{task_name.upper()}] " + example["text"]
return example
combined_dataset = concatenate_datasets([
task_data.map(lambda x: add_task_prefix(x, task_name))
for task_name, task_data in tasks.items()
])
Evaluation and Iteration
Validation During Training
training_args = TrainingArguments(
# ... other args
evaluation_strategy="epoch",
eval_steps=500,
load_best_model_at_end=True,
metric_for_best_model="eval_loss"
)
trainer = Trainer(
# ... other params
eval_dataset=tokenized_dataset["validation"]
)
Custom Metrics
from sklearn.metrics import accuracy_score
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=-1)
# Flatten and filter padding tokens
predictions = predictions.flatten()
labels = labels.flatten()
# Remove padding (-100)
mask = labels != -100
predictions = predictions[mask]
labels = labels[mask]
return {
"accuracy": accuracy_score(labels, predictions)
}
trainer = Trainer(
# ... other params
compute_metrics=compute_metrics
)
Deployment
Load and Use Fine-Tuned Model
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
torch_dtype=torch.float16
)
# Load LoRA weights
model = PeftModel.from_pretrained(base_model, "./final-model")
# Merge and save (optional - for inference optimization)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
# Inference
def generate_response(prompt):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Serve with FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 256
@app.post("/generate")
async def generate(request: GenerateRequest):
response = generate_response(request.prompt)
return {"response": response}
Best Practices
- Start Small: Begin with GPT-3.5 fine-tuning or 7B parameter models
- Quality Over Quantity: 100 high-quality examples beat 1000 mediocre ones
- Monitor Overfitting: Use validation sets and early stopping
- Iterate Quickly: Start with LoRA for fast experimentation
- Version Control: Track your training data and hyperparameters
- Cost Management: Use QLoRA for open-source, monitor OpenAI costs
Conclusion
Fine-tuning LLMs is a powerful technique when applied correctly. Choose the right approach for your use case:
- OpenAI Fine-Tuning: Fast, easy, great for most use cases
- LoRA/QLoRA: For custom control and cost optimization
- Full Fine-Tuning: When you need maximum performance and have resources
The key is high-quality data and thoughtful evaluation. Start small, measure carefully, and iterate based on real-world performance.
Happy fine-tuning! 🎯