article
The combination of good system prompts and well-structured RAG has covered every use case we've thrown at it. That's not to say fine-tuning is useless. It isn't. But it solves a different problem than most people think
Fine-tuning sounds like the answer. You've got a model that's good at general things, and you want it to be good at your specific thing. Train it on your data, make it yours. Simple.
Except we've leant towards doing it several times and never actually needed to. The combination of good system prompts and well-structured RAG has covered every use case we've thrown at it. The model doesn't need to be retrained to know your stuff. It needs to be given your stuff at the right moment.
That's not to say fine-tuning is useless. It isn't. But it solves a different problem than most people think.
Fine-tuning changes how a model behaves, not what it knows. The distinction matters.
If you want the model to know your product catalogue, your policies, your specific facts, that's retrieval. RAG. Give it the information when it needs it. Fine-tuning won't help because training data gets baked in at a point in time and goes stale the moment something changes.
If you want the model to behave differently, to adopt a particular tone, to follow a specific format, to reason in a certain way, that's where fine-tuning starts to make sense. You're not teaching it facts. You're teaching it style, structure, approach.
But even then, a well-crafted system prompt often gets you most of the way there. We've seen projects where the client was convinced they needed fine-tuning, and a few hours of prompt engineering gave them what they wanted. Cheaper, faster, easier to iterate.
When fine-tuning is genuinely the right call, you hit the cold start problem almost immediately.
Fine-tuning needs training data. Good training data. Examples of inputs and the outputs you want. And not just a handful. Hundreds, ideally thousands, of high-quality examples that represent the full range of what you're trying to achieve.
Where does that data come from?
If you're lucky, you've got logs. Real conversations, real queries, real responses that you can clean up and use. Most people aren't that lucky, or their logs are full of exactly the behaviour they're trying to fix.
If you're not lucky, you need to create the data. Which means labelling. Which means someone, usually a human, painstakingly writing out what the ideal response would be for each input. It's slow, expensive, and boring. And the quality of your fine-tuned model is directly limited by the quality of that labelling.
This is where most fine-tuning projects stall. Not because the technique doesn't work, but because getting good training data is harder than anyone expected.
Here's something we've been testing. It's not production-proven yet, but the theory is sound and the early results are promising.
The cold start problem is essentially a data problem. You need good examples, and you don't have them. But you do have access to large language models that are very good at generating plausible content.
So: use one model to generate the training prompts. Synthetic inputs that cover the range of queries you expect. Then use another model, a bigger, smarter one, to generate ideal responses. Now you've got input-output pairs.
But here's the key bit. You don't just accept those outputs. You use a third pass, another model acting as an evaluator, to score and filter the responses. The ones that meet your criteria go into the training set. The ones that don't get regenerated or discarded.
You end up with a loop. Generate, respond, evaluate, refine. Two models working together to create the training data that a third, smaller model will learn from.
It's not magic. The quality still depends on how well you define what good looks like. But it solves the cold start problem in a way that doesn't require thousands of hours of human labelling.
This leads naturally to distillation, which is the bit that's genuinely useful but rarely talked about.
Distillation is taking a big, expensive, slow model and using it to train a smaller, cheaper, faster one. You run your queries through the big model, collect the outputs, and use those as training data for the small model. The small model learns to mimic the big one, at least for your specific use case.
The result is a model that's good enough for your needs, runs at a fraction of the cost, and responds fast enough for real-time applications. You're trading general capability for specific performance. For most production use cases, that's exactly the right trade.
The technique we described above is essentially distillation with extra steps. You're using large models to generate and validate training data, then using that data to train something smaller and more focused.
Honestly? Rarely.
Start with prompting. If that doesn't work, add RAG. If that doesn't work, refine your prompts and your retrieval. Most projects never need to go further.
Fine-tuning makes sense when you need consistent behavioural changes that prompting can't achieve, when you're operating at scale and the cost or latency of large models matters, or when you've genuinely exhausted the simpler options.
Distillation makes sense when you've got something working with a big model and need to make it production-viable.
The technique we've been testing? We think it could unlock fine-tuning for use cases where the training data doesn't exist yet. But we haven't proven it in the wild.
Anyone want to try it with us?