Get AI summaries of any video or article — Sign up free
A Survey of Techniques for Maximizing LLM Performance thumbnail

A Survey of Techniques for Maximizing LLM Performance

OpenAI·
6 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Start optimization by diagnosing whether failures come from missing/mis-selected context or from inconsistent model behavior and output structure.

Briefing

Maximizing LLM performance in production depends less on finding a single “best” technique and more on diagnosing what’s actually failing—context, instruction-following, or both—then applying the right tool in the right order. Optimization is hard because signal gets buried under noise, performance is difficult to measure reliably, and it’s often unclear which change will fix the specific bottleneck. The core takeaway is a two-axis mental model: optimize “context” (what the model needs to know) separately from “LLM behavior” (how it should act). That framing matters because retrieval-augmented generation (RAG) and fine-tuning solve different problems, so treating them as a linear ladder can waste time and money.

The recommended starting point is prompt engineering paired with systematic evaluation. Clear instructions, task decomposition, and giving the model time to think (for example via reasoning-style prompting) help establish a baseline quickly. Few-shot examples often provide the next lift, and the talk emphasizes avoiding “whack-a-mole” iteration by locking in solid evals and measuring changes consistently. Prompting is also described as a poor fit for scaling in new information, reliably replicating complex styles beyond what fits in context, or minimizing token usage—issues that show up once accuracy plateaus.

When the gap is missing or insufficient information, RAG becomes the primary lever. RAG is positioned as “short-term memory”: it retrieves domain-specific content at query time, then supplies that content to the model to answer. The talk walks through a typical pipeline—embed documents into a knowledge base, run similarity search for a user question, and combine retrieved text with an instruction prompt. RAG is praised for reducing hallucinations by constraining answers to retrieved content, but it’s also flagged as fragile: if retrieval is wrong, the model has “0% chance” to be correct even when it’s told to use only the provided sources.

A detailed success story illustrates how RAG often requires many iterations and multiple components beyond embeddings alone. Starting from 45% accuracy with a two-knowledge-base setup, improvements came from hypothetical document embeddings (for that use case), chunking and embedding experiments, re-ranking (cross-encoder or rules), classification to choose the right knowledge base, and then adding tools like SQL execution for structured queries plus query expansion. The project reached 98% accuracy without fine-tuning because the failures were fundamentally context-selection problems.

To evaluate RAG more rigorously, the talk highlights Ragas (Exploding Gradients), which separates metrics into “faithfulness” and “answer relevancy” on the LLM side, and “context precision” and “context recall” on the retrieval side—helping teams tell whether accuracy is coming from better answers or from better retrieval.

Fine-tuning is then framed as “long-term memory” and behavior shaping: continuing training on smaller, domain-specific datasets to specialize a base model. It’s most useful when the base model already contains the needed knowledge but struggles with consistent instruction-following, output structure (e.g., valid JSON), or a specific methodology. Fine-tuning is less suitable for adding genuinely new knowledge (where RAG is better), and it has a slower feedback loop—so it shouldn’t start as the first move.

Two case studies reinforce the division of labor. Canva fine-tuned 3.5 Turbo to generate structured design guidelines from natural language, beating base models and even outperforming GPT-4 on expert evaluations—because the task required a precise output structure rather than new knowledge. A cautionary tale shows how fine-tuning on the wrong style data can lock in the wrong behavior: a writing assistant trained on Slack messages replicated terse Slack tone instead of the desired blog-post voice.

Finally, a Spider 1.0 benchmark walkthrough ties the framework together: start with prompt engineering and RAG, then fine-tune once error patterns suggest behavior/format issues. On Spider, simple prompt engineering began around 69%, RAG with hypothetical document embeddings and other retrieval tweaks pushed performance near state-of-the-art, and fine-tuning plus lightweight RAG example injection reached about 83.5%, demonstrating that performance gains often come from cycling between context retrieval and behavior specialization rather than following a straight line.

Cornell Notes

The talk proposes a diagnostic framework for improving LLM performance: treat “context” (what the model needs to know) and “LLM behavior” (how it should act) as separate optimization axes. Teams should start with prompt engineering plus rigorous evaluation to establish a baseline and identify error types. When failures come from missing or mis-selected information, RAG acts as short-term memory by retrieving domain content at query time; when failures come from inconsistent instruction-following or output structure, fine-tuning acts as long-term behavior shaping. RAG can fail catastrophically if retrieval is wrong, so evaluation should measure both answer quality and retrieval quality (e.g., via Ragas metrics). Fine-tuning is most effective when the needed knowledge already exists in the base model but the output format or methodology must be specialized.

Why does the talk reject a simple linear progression from prompt engineering to RAG to fine-tuning?

RAG and fine-tuning target different bottlenecks. Prompt engineering is mainly about guiding behavior using context packed into the prompt; RAG supplies missing domain information at inference time; fine-tuning changes how the model behaves or structures outputs. If the problem is context-selection, RAG can fix it without fine-tuning. If the problem is inconsistent instruction-following or strict formatting, fine-tuning can outperform further prompt tricks. The framework uses two axes—context optimization and LLM optimization—to decide which tool fits the failure mode.

What are the “best place to start” prompt engineering practices, and what do they help with?

The talk emphasizes clear instructions, splitting complex tasks into simpler subtasks, and giving the model time to think (including approaches like reasoning-style prompting). It also stresses systematic testing: build an evaluation set early, measure outputs consistently, and avoid “whack-a-mole” changes that move around the evaluation matrix. Few-shot examples are highlighted as a common next step that can improve performance and reveal whether the remaining gap is context or behavior.

How does RAG work, and what problem does it solve best?

RAG retrieves domain-specific content at query time. A typical flow embeds documents into a knowledge base, runs a similarity search for the user question, then combines retrieved text with an instruction prompt so the model answers using that content. The talk frames RAG as short-term memory: it updates the model with the right information for the specific question, which helps with hallucination control when the model is constrained to retrieved sources.

What can go wrong with RAG even when the model is told to use only retrieved content?

If retrieval returns irrelevant or incorrect passages, the model has no reliable material to answer correctly. The talk’s cautionary example describes a “hallucination-reduction” setup where labelers flagged an answer as hallucinated, but it turned out the “wrong” song was actually present in the customer’s retrieved document. The deeper lesson is that RAG adds a second failure axis—retrieval quality—so evaluation must measure both faithfulness/relevancy and retrieval precision/recall.

Which Ragas metrics help separate LLM answer quality from retrieval quality?

Ragas (Exploding Gradients) splits evaluation into four metrics: faithfulness (facts in the answer can be reconciled with retrieved content), answer relevancy (the answer addresses the user’s question rather than using retrieved content tangentially), context precision (the retrieved blocks that were actually used in the answer—useful for diagnosing “Lost in the Middle” effects), and context recall (whether the retrieved set contains the information needed to answer). This separation helps decide whether to change prompting, retrieval, or both.

When is fine-tuning likely to help, and when is it likely to be wasted effort?

Fine-tuning is best when the base model already has the needed knowledge but struggles with consistent methodology, strict output structure (like valid JSON), or instruction-following. It can also emphasize a subset of existing knowledge (e.g., a SQL dialect) or distill behavior from a larger model. It’s less effective for adding new knowledge (use RAG), and it’s a poor first step for rapidly iterating on a new use case because the feedback loop is slower and data preparation is expensive.

Review Questions

  1. How would you decide whether a performance problem is primarily a context issue or a behavior/instruction issue using the talk’s two-axis model?
  2. In a RAG system that claims “use only retrieved content,” what retrieval failure modes could still produce incorrect answers, and how would Ragas metrics help diagnose them?
  3. Why might fine-tuning improve output structure even when the model already “knows” the underlying facts? Give an example from the talk’s cases.

Key Points

  1. 1

    Start optimization by diagnosing whether failures come from missing/mis-selected context or from inconsistent model behavior and output structure.

  2. 2

    Use prompt engineering with a baseline and systematic evaluation before escalating; clear instructions, task decomposition, and time-to-think patterns are common early levers.

  3. 3

    Treat RAG as short-term memory: retrieve domain content at query time to supply information and constrain hallucinations, but expect failure if retrieval is wrong.

  4. 4

    Evaluate RAG with metrics that separate answer quality (faithfulness, answer relevancy) from retrieval quality (context precision, context recall) to avoid optimizing the wrong component.

  5. 5

    Use fine-tuning when the base model already contains the needed knowledge but needs specialization for instruction-following, strict formats (e.g., JSON), or a consistent methodology.

  6. 6

    Fine-tuning is not ideal for adding new knowledge or for fast iteration; it requires slower data and training cycles.

  7. 7

    Performance gains often come from cycling between prompt engineering, RAG, and fine-tuning rather than following a straight-line sequence.

Highlights

The talk’s central framework splits optimization into two axes: context optimization (what the model needs to know) and LLM optimization (how it needs to act).
A RAG-only project reached 98% accuracy without fine-tuning because the bottleneck was context selection and retrieval quality, not missing model knowledge.
Ragas (Exploding Gradients) evaluates RAG with four metrics that disentangle faithfulness/relevancy from context precision/recall, helping teams pinpoint whether retrieval or generation is failing.
Fine-tuning can dramatically improve efficiency and output reliability—often reducing prompt complexity—but it’s a poor tool for injecting genuinely new knowledge compared with RAG.

Topics

Mentioned

  • John Allard
  • Colin
  • RAG
  • GPT-4
  • GPT-4 Turbo
  • SQL
  • ELO