Fine-Tuning vs Prompt Engineering: How to Pick the Right Lever

Fine-tuning is expensive, slow, and often unnecessary.

That’s not an opinion — it’s what you discover after spending a week preparing a training dataset, running a fine-tuning job, evaluating the results, and realising a well-crafted 5-shot prompt would have got you 90% of the way there in two hours.

The good news: the decision is mostly mechanical. Ask three questions. The answers tell you where to invest.

The 3 questions before you fine-tune

Question 1: Have you tried prompt engineering with at least 5 examples?

Few-shot prompting — giving the model 3 to 10 examples of the input-output pattern you want — dramatically narrows model behaviour. If you haven’t done this, you’re not ready to fine-tune. Full stop.

Most tasks where “the model gets it wrong” are actually tasks where the prompt doesn’t show the model what “right” looks like in your specific domain.

Question 2: Is the gap a knowledge gap or a behaviour gap?

A knowledge gap means the model doesn’t know facts it needs — internal product names, recent events, proprietary data. Fine-tuning doesn’t reliably solve this. RAG does.

A behaviour gap means the model knows the facts but formats the output wrong, uses the wrong tone, or fails to follow a consistent structure you need. That’s a format and pattern problem — something few-shot prompting addresses directly.

Fine-tuning genuinely wins when you have a style or task pattern that’s too complex to demonstrate in a context window — or when you need to consistently reproduce a very specific output schema across thousands of calls.

Question 3: What are the real costs?

	Few-shot prompting	Fine-tuning
Setup time	Hours	Days to weeks
Data required	3–10 examples	100–10,000+ examples
Cost per call	Higher token count	Lower token count (smaller model)
Latency	Depends on context size	Usually faster (smaller model)
Update cycle	Edit the prompt	Retrain, re-evaluate, redeploy

If your production volume is under 10,000 calls per day, the token cost difference between a few-shot prompt and a fine-tuned model is usually smaller than the engineering time to fine-tune. Run the numbers on your actual usage before committing.

Few-shot prompting patterns that actually work

The structure that consistently produces clean, reproducible outputs:

You are [specific role]. You [precise task description].

Format: [exact output format]

Examples:
---
Input: [example 1 input]
Output: [example 1 output]
---
Input: [example 2 input]
Output: [example 2 output]
---
[3–5 more examples]
---
Input: {user_input}
Output:

Three things that make few-shot prompts work:

Consistency in examples — every example must follow the same format, including edge cases. Mixed formatting in examples produces mixed output.
Cover the failure cases — if the model keeps producing output X when you want Y, add an example that shows the X-input → Y-output mapping explicitly.
End the prompt with Output: and nothing else — this cues the model to continue the pattern, not explain it.

When fine-tuning wins: LoRA basics

If prompting genuinely isn’t enough, LoRA (Low-Rank Adaptation) is the practical way to fine-tune today. Instead of updating all model weights, LoRA injects small trainable adapter matrices into the attention layers. The original model weights freeze; only the adapters train.

Why this matters for you:

A LoRA adapter for a 7B model trains on a single A10 GPU in hours, not days
The adapter file is small (often under 100MB) and can be hot-swapped
On AWS, a single ml.g4dn.xlarge SageMaker training job (roughly ₹15–20 per hour) can fine-tune most 7B models with LoRA in under 4 hours

The Python ecosystem for this is peft (Parameter-Efficient Fine-Tuning from Hugging Face), which wraps any compatible model with LoRA adapters in a few lines.

End project: benchmark few-shot vs LoRA on the same task

This script runs a classification or extraction task twice — once with 5-shot prompting, once with a LoRA-adapted model — and prints a comparison table.

#!/usr/bin/env python3
"""
Benchmark: 5-shot prompting vs LoRA-adapted model
Requirements: pip install transformers peft datasets torch
"""

import time
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
import torch

# --- Config ---
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
LORA_ADAPTER_PATH = "./lora-adapter"   # path to your trained LoRA adapter
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Test dataset: (input, expected_output) pairs
TEST_CASES = [
    ("The shipment arrived damaged.", "complaint"),
    ("Can you tell me your business hours?", "inquiry"),
    ("I'd like to cancel my subscription.", "cancellation"),
    ("This is the best product I've ever used!", "praise"),
    ("My order hasn't arrived in 3 weeks.", "complaint"),
]

# Few-shot prompt template
FEW_SHOT_PROMPT = """Classify the customer message into one of: complaint, inquiry, praise, cancellation.

Examples:
---
Message: "The item broke after one use."
Category: complaint
---
Message: "What payment methods do you accept?"
Category: inquiry
---
Message: "Amazing quality, will buy again!"
Category: praise
---
Message: "Please cancel my order #12345."
Category: cancellation
---
Message: "I never received my package."
Category: complaint
---
Message: "{input}"
Category:"""


def run_few_shot(test_cases, model_id, device):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id, torch_dtype=torch.float16, device_map=device
    )
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)

    results = []
    for text, expected in test_cases:
        prompt = FEW_SHOT_PROMPT.format(input=text)
        start = time.time()
        output = pipe(prompt)[0]["generated_text"]
        latency = time.time() - start
        predicted = output.split("Category:")[-1].strip().split()[0].lower()
        results.append({
            "input": text[:40],
            "expected": expected,
            "predicted": predicted,
            "correct": predicted == expected,
            "latency_ms": round(latency * 1000),
            "method": "5-shot"
        })
    return results


def run_lora(test_cases, base_model_id, adapter_path, device):
    tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    base = AutoModelForCausalLM.from_pretrained(
        base_model_id, torch_dtype=torch.float16, device_map=device
    )
    model = PeftModel.from_pretrained(base, adapter_path)
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)

    results = []
    for text, expected in test_cases:
        # LoRA model uses minimal prompt — no examples needed
        prompt = f"Classify this customer message: \"{text}\"\nCategory:"
        start = time.time()
        output = pipe(prompt)[0]["generated_text"]
        latency = time.time() - start
        predicted = output.split("Category:")[-1].strip().split()[0].lower()
        results.append({
            "input": text[:40],
            "expected": expected,
            "predicted": predicted,
            "correct": predicted == expected,
            "latency_ms": round(latency * 1000),
            "method": "LoRA"
        })
    return results


def print_table(results):
    print(f"\n{'Method':<8} {'Input':<42} {'Expected':<14} {'Predicted':<14} {'Correct':<8} {'Latency'}")
    print("-" * 100)
    for r in results:
        tick = "YES" if r["correct"] else "NO"
        print(f"{r['method']:<8} {r['input']:<42} {r['expected']:<14} {r['predicted']:<14} {tick:<8} {r['latency_ms']}ms")

    for method in ["5-shot", "LoRA"]:
        subset = [r for r in results if r["method"] == method]
        accuracy = sum(r["correct"] for r in subset) / len(subset) * 100
        avg_latency = sum(r["latency_ms"] for r in subset) / len(subset)
        print(f"\n{method}: accuracy={accuracy:.0f}%  avg_latency={avg_latency:.0f}ms")


if __name__ == "__main__":
    print("Running 5-shot prompting benchmark...")
    few_shot_results = run_few_shot(TEST_CASES, BASE_MODEL, DEVICE)

    print("Running LoRA benchmark...")
    lora_results = run_lora(TEST_CASES, BASE_MODEL, LORA_ADAPTER_PATH, DEVICE)

    print_table(few_shot_results + lora_results)

To train the LoRA adapter before running this benchmark, see the peft library’s SFTTrainer example with 200–500 labelled examples. The point of the script isn’t to show one always wins — it’s to make the tradeoff concrete and measurable on your actual task.

If the accuracy numbers are within 5 percentage points, ship the prompt. If LoRA wins by 15+ points on a task you run 100,000 times a day, the training investment pays off. Now you have data to make that call.

See this applied in the ML path →