A Survey of Techniques for Maximizing LLM Performance
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Start optimization by diagnosing whether failures come from missing/mis-selected context or from inconsistent model behavior and output structure.
Briefing
Maximizing LLM performance in production depends less on finding a single “best” technique and more on diagnosing what’s actually failing—context, instruction-following, or both—then applying the right tool in the right order. Optimization is hard because signal gets buried under noise, performance is difficult to measure reliably, and it’s often unclear which change will fix the specific bottleneck. The core takeaway is a two-axis mental model: optimize “context” (what the model needs to know) separately from “LLM behavior” (how it should act). That framing matters because retrieval-augmented generation (RAG) and fine-tuning solve different problems, so treating them as a linear ladder can waste time and money.
The recommended starting point is prompt engineering paired with systematic evaluation. Clear instructions, task decomposition, and giving the model time to think (for example via reasoning-style prompting) help establish a baseline quickly. Few-shot examples often provide the next lift, and the talk emphasizes avoiding “whack-a-mole” iteration by locking in solid evals and measuring changes consistently. Prompting is also described as a poor fit for scaling in new information, reliably replicating complex styles beyond what fits in context, or minimizing token usage—issues that show up once accuracy plateaus.
When the gap is missing or insufficient information, RAG becomes the primary lever. RAG is positioned as “short-term memory”: it retrieves domain-specific content at query time, then supplies that content to the model to answer. The talk walks through a typical pipeline—embed documents into a knowledge base, run similarity search for a user question, and combine retrieved text with an instruction prompt. RAG is praised for reducing hallucinations by constraining answers to retrieved content, but it’s also flagged as fragile: if retrieval is wrong, the model has “0% chance” to be correct even when it’s told to use only the provided sources.
A detailed success story illustrates how RAG often requires many iterations and multiple components beyond embeddings alone. Starting from 45% accuracy with a two-knowledge-base setup, improvements came from hypothetical document embeddings (for that use case), chunking and embedding experiments, re-ranking (cross-encoder or rules), classification to choose the right knowledge base, and then adding tools like SQL execution for structured queries plus query expansion. The project reached 98% accuracy without fine-tuning because the failures were fundamentally context-selection problems.
To evaluate RAG more rigorously, the talk highlights Ragas (Exploding Gradients), which separates metrics into “faithfulness” and “answer relevancy” on the LLM side, and “context precision” and “context recall” on the retrieval side—helping teams tell whether accuracy is coming from better answers or from better retrieval.
Fine-tuning is then framed as “long-term memory” and behavior shaping: continuing training on smaller, domain-specific datasets to specialize a base model. It’s most useful when the base model already contains the needed knowledge but struggles with consistent instruction-following, output structure (e.g., valid JSON), or a specific methodology. Fine-tuning is less suitable for adding genuinely new knowledge (where RAG is better), and it has a slower feedback loop—so it shouldn’t start as the first move.
Two case studies reinforce the division of labor. Canva fine-tuned 3.5 Turbo to generate structured design guidelines from natural language, beating base models and even outperforming GPT-4 on expert evaluations—because the task required a precise output structure rather than new knowledge. A cautionary tale shows how fine-tuning on the wrong style data can lock in the wrong behavior: a writing assistant trained on Slack messages replicated terse Slack tone instead of the desired blog-post voice.
Finally, a Spider 1.0 benchmark walkthrough ties the framework together: start with prompt engineering and RAG, then fine-tune once error patterns suggest behavior/format issues. On Spider, simple prompt engineering began around 69%, RAG with hypothetical document embeddings and other retrieval tweaks pushed performance near state-of-the-art, and fine-tuning plus lightweight RAG example injection reached about 83.5%, demonstrating that performance gains often come from cycling between context retrieval and behavior specialization rather than following a straight line.
Cornell Notes
The talk proposes a diagnostic framework for improving LLM performance: treat “context” (what the model needs to know) and “LLM behavior” (how it should act) as separate optimization axes. Teams should start with prompt engineering plus rigorous evaluation to establish a baseline and identify error types. When failures come from missing or mis-selected information, RAG acts as short-term memory by retrieving domain content at query time; when failures come from inconsistent instruction-following or output structure, fine-tuning acts as long-term behavior shaping. RAG can fail catastrophically if retrieval is wrong, so evaluation should measure both answer quality and retrieval quality (e.g., via Ragas metrics). Fine-tuning is most effective when the needed knowledge already exists in the base model but the output format or methodology must be specialized.
Why does the talk reject a simple linear progression from prompt engineering to RAG to fine-tuning?
What are the “best place to start” prompt engineering practices, and what do they help with?
How does RAG work, and what problem does it solve best?
What can go wrong with RAG even when the model is told to use only retrieved content?
Which Ragas metrics help separate LLM answer quality from retrieval quality?
When is fine-tuning likely to help, and when is it likely to be wasted effort?
Review Questions
- How would you decide whether a performance problem is primarily a context issue or a behavior/instruction issue using the talk’s two-axis model?
- In a RAG system that claims “use only retrieved content,” what retrieval failure modes could still produce incorrect answers, and how would Ragas metrics help diagnose them?
- Why might fine-tuning improve output structure even when the model already “knows” the underlying facts? Give an example from the talk’s cases.
Key Points
- 1
Start optimization by diagnosing whether failures come from missing/mis-selected context or from inconsistent model behavior and output structure.
- 2
Use prompt engineering with a baseline and systematic evaluation before escalating; clear instructions, task decomposition, and time-to-think patterns are common early levers.
- 3
Treat RAG as short-term memory: retrieve domain content at query time to supply information and constrain hallucinations, but expect failure if retrieval is wrong.
- 4
Evaluate RAG with metrics that separate answer quality (faithfulness, answer relevancy) from retrieval quality (context precision, context recall) to avoid optimizing the wrong component.
- 5
Use fine-tuning when the base model already contains the needed knowledge but needs specialization for instruction-following, strict formats (e.g., JSON), or a consistent methodology.
- 6
Fine-tuning is not ideal for adding new knowledge or for fast iteration; it requires slower data and training cycles.
- 7
Performance gains often come from cycling between prompt engineering, RAG, and fine-tuning rather than following a straight-line sequence.