Fine-tuning is expensive, slow, and often unnecessary.
That’s not an opinion — it’s what you discover after spending a week preparing a training dataset, running a fine-tuning job, evaluating the results, and realising a well-crafted 5-shot prompt would have got you 90% of the way there in two hours.
The good news: the decision is mostly mechanical. Ask three questions. The answers tell you where to invest.
The 3 questions before you fine-tune
Question 1: Have you tried prompt engineering with at least 5 examples?
Few-shot prompting — giving the model 3 to 10 examples of the input-output pattern you want — dramatically narrows model behaviour. If you haven’t done this, you’re not ready to fine-tune. Full stop.
Most tasks where “the model gets it wrong” are actually tasks where the prompt doesn’t show the model what “right” looks like in your specific domain.
Question 2: Is the gap a knowledge gap or a behaviour gap?
A knowledge gap means the model doesn’t know facts it needs — internal product names, recent events, proprietary data. Fine-tuning doesn’t reliably solve this. RAG does.
A behaviour gap means the model knows the facts but formats the output wrong, uses the wrong tone, or fails to follow a consistent structure you need. That’s a format and pattern problem — something few-shot prompting addresses directly.
Fine-tuning genuinely wins when you have a style or task pattern that’s too complex to demonstrate in a context window — or when you need to consistently reproduce a very specific output schema across thousands of calls.
Question 3: What are the real costs?
| Few-shot prompting | Fine-tuning | |
|---|---|---|
| Setup time | Hours | Days to weeks |
| Data required | 3–10 examples | 100–10,000+ examples |
| Cost per call | Higher token count | Lower token count (smaller model) |
| Latency | Depends on context size | Usually faster (smaller model) |
| Update cycle | Edit the prompt | Retrain, re-evaluate, redeploy |
If your production volume is under 10,000 calls per day, the token cost difference between a few-shot prompt and a fine-tuned model is usually smaller than the engineering time to fine-tune. Run the numbers on your actual usage before committing.
Few-shot prompting patterns that actually work
The structure that consistently produces clean, reproducible outputs:
You are [specific role]. You [precise task description].
Format: [exact output format]
Examples:
---
Input: [example 1 input]
Output: [example 1 output]
---
Input: [example 2 input]
Output: [example 2 output]
---
[3–5 more examples]
---
Input: {user_input}
Output:
Three things that make few-shot prompts work:
- Consistency in examples — every example must follow the same format, including edge cases. Mixed formatting in examples produces mixed output.
- Cover the failure cases — if the model keeps producing output X when you want Y, add an example that shows the X-input → Y-output mapping explicitly.
- End the prompt with
Output:and nothing else — this cues the model to continue the pattern, not explain it.
When fine-tuning wins: LoRA basics
If prompting genuinely isn’t enough, LoRA (Low-Rank Adaptation) is the practical way to fine-tune today. Instead of updating all model weights, LoRA injects small trainable adapter matrices into the attention layers. The original model weights freeze; only the adapters train.
Why this matters for you:
- A LoRA adapter for a 7B model trains on a single A10 GPU in hours, not days
- The adapter file is small (often under 100MB) and can be hot-swapped
- On AWS, a single
ml.g4dn.xlargeSageMaker training job (roughly ₹15–20 per hour) can fine-tune most 7B models with LoRA in under 4 hours
The Python ecosystem for this is peft (Parameter-Efficient Fine-Tuning from Hugging Face), which wraps any compatible model with LoRA adapters in a few lines.
End project: benchmark few-shot vs LoRA on the same task
This script runs a classification or extraction task twice — once with 5-shot prompting, once with a LoRA-adapted model — and prints a comparison table.
#!/usr/bin/env python3
"""
Benchmark: 5-shot prompting vs LoRA-adapted model
Requirements: pip install transformers peft datasets torch
"""
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
import torch
# --- Config ---
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
LORA_ADAPTER_PATH = "./lora-adapter" # path to your trained LoRA adapter
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Test dataset: (input, expected_output) pairs
TEST_CASES = [
("The shipment arrived damaged.", "complaint"),
("Can you tell me your business hours?", "inquiry"),
("I'd like to cancel my subscription.", "cancellation"),
("This is the best product I've ever used!", "praise"),
("My order hasn't arrived in 3 weeks.", "complaint"),
]
# Few-shot prompt template
FEW_SHOT_PROMPT = """Classify the customer message into one of: complaint, inquiry, praise, cancellation.
Examples:
---
Message: "The item broke after one use."
Category: complaint
---
Message: "What payment methods do you accept?"
Category: inquiry
---
Message: "Amazing quality, will buy again!"
Category: praise
---
Message: "Please cancel my order #12345."
Category: cancellation
---
Message: "I never received my package."
Category: complaint
---
Message: "{input}"
Category:"""
def run_few_shot(test_cases, model_id, device):
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map=device
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)
results = []
for text, expected in test_cases:
prompt = FEW_SHOT_PROMPT.format(input=text)
start = time.time()
output = pipe(prompt)[0]["generated_text"]
latency = time.time() - start
predicted = output.split("Category:")[-1].strip().split()[0].lower()
results.append({
"input": text[:40],
"expected": expected,
"predicted": predicted,
"correct": predicted == expected,
"latency_ms": round(latency * 1000),
"method": "5-shot"
})
return results
def run_lora(test_cases, base_model_id, adapter_path, device):
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base = AutoModelForCausalLM.from_pretrained(
base_model_id, torch_dtype=torch.float16, device_map=device
)
model = PeftModel.from_pretrained(base, adapter_path)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)
results = []
for text, expected in test_cases:
# LoRA model uses minimal prompt — no examples needed
prompt = f"Classify this customer message: \"{text}\"\nCategory:"
start = time.time()
output = pipe(prompt)[0]["generated_text"]
latency = time.time() - start
predicted = output.split("Category:")[-1].strip().split()[0].lower()
results.append({
"input": text[:40],
"expected": expected,
"predicted": predicted,
"correct": predicted == expected,
"latency_ms": round(latency * 1000),
"method": "LoRA"
})
return results
def print_table(results):
print(f"\n{'Method':<8} {'Input':<42} {'Expected':<14} {'Predicted':<14} {'Correct':<8} {'Latency'}")
print("-" * 100)
for r in results:
tick = "YES" if r["correct"] else "NO"
print(f"{r['method']:<8} {r['input']:<42} {r['expected']:<14} {r['predicted']:<14} {tick:<8} {r['latency_ms']}ms")
for method in ["5-shot", "LoRA"]:
subset = [r for r in results if r["method"] == method]
accuracy = sum(r["correct"] for r in subset) / len(subset) * 100
avg_latency = sum(r["latency_ms"] for r in subset) / len(subset)
print(f"\n{method}: accuracy={accuracy:.0f}% avg_latency={avg_latency:.0f}ms")
if __name__ == "__main__":
print("Running 5-shot prompting benchmark...")
few_shot_results = run_few_shot(TEST_CASES, BASE_MODEL, DEVICE)
print("Running LoRA benchmark...")
lora_results = run_lora(TEST_CASES, BASE_MODEL, LORA_ADAPTER_PATH, DEVICE)
print_table(few_shot_results + lora_results)
To train the LoRA adapter before running this benchmark, see the peft library’s SFTTrainer example with 200–500 labelled examples. The point of the script isn’t to show one always wins — it’s to make the tradeoff concrete and measurable on your actual task.
If the accuracy numbers are within 5 percentage points, ship the prompt. If LoRA wins by 15+ points on a task you run 100,000 times a day, the training investment pays off. Now you have data to make that call.