Get AI summaries of any video or article — Sign up free
OpenAI DevDay 2024 | Balancing accuracy, latency, and cost at scale thumbnail

OpenAI DevDay 2024 | Balancing accuracy, latency, and cost at scale

OpenAI·
6 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Define a business-relevant accuracy target (e.g., first-attempt routing success) before optimizing for speed or cost.

Briefing

Scaling an LLM-powered app from thousands to millions of users forces hard tradeoffs between accuracy, latency, and cost—and the most reliable path is to treat optimization as a measurable loop: set an accuracy target with evals, then “hill climb” toward it using prompt engineering, RAG, and fine-tuning, and only after that optimize speed and spend. The practical payoff is straightforward: once accuracy is locked in, reducing latency and per-request cost increases the number of inferences possible within the same budget, which can translate into better overall product performance.

OpenAI frames the process as a two-stage pipeline. First, accuracy: builders should define an accuracy target in business terms (for example, “90% of customer service tickets are routed correctly on the first attempt”), then build baseline evaluations to measure end-to-end behavior in production-like conditions. Evals come in two main forms—component evals (unit-test style checks for individual steps) and end-to-end evals (black-box tests that run the full workflow from input to final output). A key best practice for scaling complex systems—especially customer service networks with multiple intents, assistants, and tools—is to mine historic conversations, have LLMs simulate customers, and run them through the routing/instruction network to detect regressions and “divergence” when changes elsewhere break earlier assumptions. In one example, a customer service setup expanded from 50 routines to over 400, with more than 1,000 evals rerun after material changes.

Second, once the accuracy target is agreed upon, optimization focuses on reaching it at the lowest cost. OpenAI describes a “four-box” approach: start with prompt engineering (explicit instructions and iterative eval-driven debugging), then move to retrieval augmented generation (RAG) when missing context is the failure mode, and fine-tuning when the model struggles with instruction-following consistency or needs more examples. Recent shifts include using long context to scale prompt engineering (without fully replacing RAG), and automating prompt optimization via “meta prompting,” where a model iterates prompts based on eval results—accelerated by models like o1 for tasks such as generating and optimizing customer service routines.

A concrete accuracy case study starts from a weak baseline (45% accuracy using cosine similarity retrieval) in a regulated domain. The team set a tolerance that prioritized false negatives over false positives to reduce harmful hallucinations, then pushed toward a 95% target. Improvements came from chunking and embedding grid searches, adding a classification step to choose keyword vs semantic search, introducing a domain-specific reranker, and handling analytical questions with SQL generation plus query expansion.

After accuracy is achieved, latency and cost become the next levers. Latency is decomposed into network latency, time to first token (prompt processing), and output latency (time between tokens). Output latency dominates for most workloads, but exceptions exist (e.g., long-document classifiers). OpenAI highlights practical tactics: reduce prompt length where possible, choose smaller models when feasible (e.g., o1 mini vs larger options), and cut output tokens by prompting for concise answers. Network latency is also being reduced through regional routing so requests are processed closer to where they originate.

Cost optimization overlaps with latency work: fewer tokens generally means both faster responses and lower spend. OpenAI emphasizes operational controls (per-project usage limits and alerts), then introduces two major efficiency tools. Prompt caching launches with a prefix-match requirement—static instructions and examples should be placed at the start of prompts so repeated requests reuse cached activations (with savings of 50% on cached tokens). BatchAPI offers another 50% discount on both prompt and output tokens by running large asynchronous workloads within up to 24 hours, without consuming regular rate limits. Echo AI’s customer-call categorization workflow illustrates the trade: near-real-time processing via batch jobs can cut costs dramatically while still supporting scalable operations.

Overall, the central message is that there is no single playbook. The art is selecting the right tradeoffs among accuracy, latency, and cost—using eval-driven development to make those tradeoffs measurable and safe.

Cornell Notes

Scaling LLM apps reliably starts with measurable accuracy. Builders should define a business-relevant accuracy target, build baseline evals (component and end-to-end), and use LLM-simulated “customer” runs to catch regressions in complex routing/tool networks. Once the target is set, optimization becomes a hill climb using prompt engineering first, then RAG for missing context, and fine-tuning when consistency or examples are the bottleneck. After accuracy is locked, latency and cost are reduced by decomposing latency into network, time-to-first-token, and output latency, then cutting prompt and output tokens and choosing smaller models when possible. Cost drops further with prompt caching (prefix-match) and BatchAPI for asynchronous workloads at half price.

How do evals prevent “silent failures” when an LLM app changes over time?

OpenAI recommends eval-driven development with two complementary eval types. Component evals act like unit tests for single steps (e.g., routing a question to the right intent, or calling the correct tools). End-to-end evals run the full workflow from input to final output, catching multi-step breakages. For customer service networks, a scaling tactic is to mine historic conversations, generate LLM-simulated customer objectives, run them through the intent/tool network, and mark pass/fail across routing, tool use, and whether the customer’s goal was achieved. This helps detect divergence—changes in one part of the network breaking behavior elsewhere—so regressions are caught whenever the system is updated.

Why does OpenAI insist on an explicit accuracy target before optimizing cost and speed?

Because “good enough” accuracy is where many teams stall. OpenAI frames accuracy as something that must deliver ROI and business safety, not a vague technical threshold. In a customer service example, pilots sat around 80–85% accuracy, and management needed a break-even point. A cost model used assumptions like saving $20 per case when triage succeeds on the first attempt, losing $40 per escalated case due to human handling time, and losing $1,000 for churn among 5% of escalated customers. That produced a break-even accuracy around 81.5%, after which the team agreed on a 90% target for shipping. Notably, human agents in that scenario reached only 66% accuracy, making the 90% LLM target easier to justify.

What is the practical “four-box” optimization path for reaching an accuracy target?

The approach starts with prompt engineering: use explicit instructions, run evals, identify failure modes, and iterate. If the model fails because it lacks information, retrieval augmented generation (RAG) is the next move. If failures come from inconsistent instruction following or missing examples/style, fine-tuning is used. OpenAI also notes evolving best practices: long context can scale prompt engineering more effectively (without fully replacing RAG), and meta prompting can automate prompt iteration by using a model to generate prompt changes based on eval results—often accelerated by o1 for tasks like generating and optimizing customer service routines.

How should latency be analyzed for LLM apps, and what usually dominates?

Latency isn’t treated like a database metric. OpenAI breaks total request latency into three parts: network latency (time to route to GPUs and back), time to first token (TTFT, prompt processing), and output latency (time between tokens, often called TBT). For many workloads, output latency dominates—often 90%+ of total time—because generating tokens is the main cost driver. Exceptions exist, such as classifiers over long documents where prompt/input processing can dominate. Optimization then focuses on the biggest component: shorten prompts to reduce TTFT, reduce output tokens by prompting for concise answers, and choose smaller models when possible.

What are prompt caching and BatchAPI, and what constraints make them effective?

Prompt caching (launched in the transcript) saves money and speeds up requests when the prompt prefix matches a previously seen prompt. It uses prefix matching, so even a one-character change at the beginning breaks cache reuse; static instructions, one-shot examples, and function calls should be placed at the start, while variable user content goes later. OpenAI says caches typically stay alive for about 5–10 minutes. BatchAPI provides 50% off both prompt and output tokens by running asynchronous batches that complete within 24 hours (often faster off-peak) and do not count against regular rate limits. It’s best for workloads like evals, content generation at scale, indexing for retrieval, and other tasks that don’t require immediate synchronous responses.

Review Questions

  1. What two categories of evals does OpenAI recommend, and how does each one help catch different failure modes?
  2. In the latency breakdown (network latency, TTFT, and output latency), which component usually dominates for most LLM workloads, and what tactics reduce it?
  3. Why does prompt caching require prefix stability, and how should prompt structure be arranged to maximize cache hits?

Key Points

  1. 1

    Define a business-relevant accuracy target (e.g., first-attempt routing success) before optimizing for speed or cost.

  2. 2

    Use both component evals and end-to-end evals to measure behavior from individual steps through full workflows.

  3. 3

    Scale production confidence by simulating customers with LLMs and rerunning large eval suites after material changes to routing/tool networks.

  4. 4

    Reach the accuracy target via a sequence: prompt engineering first, then RAG for missing context, then fine-tuning for consistency and example/style gaps.

  5. 5

    Optimize latency by decomposing it into network latency, time to first token, and output latency; output latency often accounts for 90%+ of total time.

  6. 6

    Reduce cost with overlapping tactics: fewer tokens generally means faster responses and lower spend, plus operational controls like per-project usage limits.

  7. 7

    Use prompt caching (prefix-match) and BatchAPI (asynchronous, half-price tokens, separate rate limits) to cut costs without sacrificing required accuracy.

Highlights

Accuracy optimization is treated as a measurable gate: build evals, set an accuracy target tied to ROI, then optimize until the target is met before chasing latency and cost.
Customer service networks can regress through “divergence,” so teams should mine historic conversations, generate LLM-simulated customer cases, and rerun thousands of end-to-end evals after changes.
Latency is analyzed as network latency + TTFT + output latency (time between tokens); output generation usually dominates, so concise outputs and smaller models often deliver the biggest wins.
Prompt caching only works with prefix matches, so static instructions and examples should be placed at the start of prompts to maximize cache hits.
BatchAPI offers 50% off prompt and output tokens for asynchronous workloads and doesn’t consume regular rate limits, enabling large-scale processing within up to 24 hours.

Topics

  • LLM App Scaling
  • Eval-Driven Development
  • Accuracy Targets
  • Latency Optimization
  • Cost Optimization

Mentioned

  • Colin Jarvis
  • Jeff Harris
  • LLM
  • RAG
  • TTFT
  • TBT
  • API
  • GPT-4o
  • GPT-4 Turbo
  • GPT-4o mini
  • o1
  • o1 preview
  • o1 mini
  • SQL