Get AI summaries of any video or article — Sign up free
What I Tell Every CTO Before They Touch Claude Code or the Anthropic API thumbnail

What I Tell Every CTO Before They Touch Claude Code or the Anthropic API

6 min read

Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Define correctness as a measurable target before making architectural choices like RAG, agent counts, orchestration, or context design.

Briefing

The central lesson for CTOs and anyone building with Claude Code or the Anthropic API is blunt: AI systems only become reliable when “correctness” is defined precisely enough to measure—and that definition must be baked into architecture, not left vague for later. Without a clear quality target, teams end up chasing an unnamed, shifting goalpost; downstream choices like retrieval-augmented generation, agent design, orchestration, context engineering, and model selection become elaborate ways of building on top of uncertainty.

Correctness in AI rarely behaves like a simple pass/fail test. Instead, it’s a bundle of competing requirements—truthfulness, completeness, tone, policy compliance, refusal behavior, speed and cost tradeoffs, and in enterprise settings, auditability. The hard part is deciding what kinds of uncertainty are acceptable and which inaccuracies are fatal. That decision has to stay useful across both unstructured and structured data, especially when systems combine retrieved text with structured records for compliance workflows or board decks. “Close enough” can still be wrong by a single digit, and in trust-sensitive contexts that’s not a rounding error—it’s a credibility failure.

A recurring failure mode comes from how humans operate: definitions of “good” drift midstream. Product priorities change after OKRs are set; stakeholders quietly renegotiate what counts as correct; then the system gets blamed for unreliability. The same pattern shows up in agentic builds when teams discover correctness late, forcing repeated architecture changes and leaving engineers stuck in endless debates like whether answers must be confident and fast with no caveats, or slow but exact, or always include narrative context. The transcript frames this as a “correctness discovery” problem that multiplies complexity—especially when correctness depends on judgment calls, such as whether an agent should update a sales contact record based on gentic search signals or defer to the salesperson’s forecasting.

This precision problem also explains why hallucinations persist. If evaluation setups reward confident answers over honest uncertainty, models learn to guess rather than to say “I don’t know.” The issue isn’t framed as a purely model limitation; it’s a reward-and-definition mismatch between what humans actually want and what the system is incentivized to produce. Measurement distorts behavior as well: once a proxy metric becomes the target, systems optimize for the proxy instead of the intent. The transcript gives a concrete example of Gemini 3’s tendency toward stronger single-turn performance than multi-turn conversation, attributing it to how reinforcement learning rewards were structured and how training data reflects what gets rewarded.

The practical prescription is to build a culture of correctness that resists gaming. That means using multiple criteria, defining explicit failure modes, enabling calibrated uncertainty (including when the system should refuse), and tracking provenance so claims can be traced to sources. Good evaluations aren’t busywork; they force teams to articulate what correctness means and to test it at both unit and orchestration levels.

Finally, the transcript argues that organizations often fail because they never answered “what good looks like” in the first place—then they sell AI on top of dirty data and expect adoption. The closing framework reframes correctness as a set of claims the system is allowed to make, the evidence required for each claim, and the penalties for being wrong versus staying silent. Prompting becomes a workflow that imposes a quality bar; the better the prompt, the clearer the expected output and the clearer the quality target. The question left hanging is whether teams can actually name what “good” means before they ask an AI to deliver it.

Cornell Notes

Reliability in AI hinges on defining “correctness” precisely enough to measure. In probabilistic systems, correctness is not binary; it’s a set of requirements (truthfulness, completeness, policy compliance, refusal behavior, auditability, and more) plus rules for what uncertainty is acceptable. Teams often fail when stakeholders keep moving the goalposts, discovering correctness late and forcing repeated architectural changes. Measurement also distorts behavior: if evaluations reward confidence or a proxy metric, models learn to optimize for the wrong thing, producing hallucinations or other unwanted behaviors. The transcript’s prescription is to treat correctness as allowed claims with required evidence and explicit penalties for being wrong versus staying silent, then enforce it through prompts, evaluations, and architecture.

Why does “correctness” need to be defined before architecture decisions like RAG, agents, or orchestration?

Because those downstream choices become built on top of an unnamed, shifting target if “correct” isn’t defined. The transcript emphasizes that AI projects often fail not because the model is dumb, but because nobody can answer what correctness means in that context. If correctness can’t be named, it can’t be measured; if it can’t be measured, it can’t be improved. That makes retrieval strategies, agent design, context engineering, and model selection feel like elaborate fixes for a problem that was never specified.

What makes correctness harder in AI than in traditional software?

Traditional software can treat correctness as pass/fail via tests. AI is probabilistic, so correctness is a bundle of competing requirements: truthfulness, completeness, tone, policy compliance, refusal behavior, and operational constraints like speed/cost. Enterprise systems add auditability. The transcript also stresses deciding which uncertainty types are allowed and which inaccuracies are fatal—especially when combining unstructured retrieval (which can sound right but be wrong) with structured data (which can be correct yet unusable when merged).

How do evaluation and reward design contribute to hallucinations?

When evaluation setups reward confident answers over honest uncertainty, systems learn to guess. The transcript cites OpenAI guidance/papers arguing that common evaluation setups keep hallucinations alive unless the definition of correctness changes. It frames this as a human problem: the system optimizes what humans reward, so if “acceptable” is defined as confident statements even when uncertain, the model will mirror that incentive rather than correct it.

What is “measurement distorts behavior,” and how does it show up in model behavior?

Once a measure becomes a target, it stops being a good measure (Goodhart’s law). In AI, proxy metrics can cause reward hacking: the system satisfies the literal objective while missing the intent. The transcript gives Gemini 3 as an example—optimized for single-turn prompts and weaker in multi-turn conversation—attributed to reinforcement learning setups with limited rewarded examples of multi-turn dynamics, while real users often interact in single-turn ways.

What does a “culture of correctness” look like in practice?

The transcript recommends resisting gaming by using multiple criteria for correctness, defining explicit failure modes, enabling calibrated uncertainty (including when to refuse), and using provenance so claims can be traced to sources. It also argues for testing at two levels: unit tests for individual agents and system-level tests for orchestration, so correctness is enforced both locally and end-to-end.

How should teams reframe correctness to make it actionable?

Instead of treating correctness as something humans can keep vague, the transcript proposes modeling it as: (1) the set of claims the system is allowed to make, (2) the evidence required for each claim, and (3) the penalties for being wrong versus staying silent. If teams can’t list allowed claims, they haven’t broken the problem down enough for the AI to act reliably. Prompting is treated as a workflow that imposes a quality bar, so prompts should clearly specify what “good” output looks like.

Review Questions

  1. What specific components of correctness (e.g., truthfulness, auditability, refusal behavior) must be decided before choosing agent architecture?
  2. How can an evaluation metric unintentionally reward the wrong behavior, and what concrete changes would prevent that?
  3. In the claims/evidence/penalties framework, what does it mean to allow the system to “stay silent,” and how would you test that behavior?

Key Points

  1. 1

    Define correctness as a measurable target before making architectural choices like RAG, agent counts, orchestration, or context design.

  2. 2

    Treat correctness as a bundle of requirements (truthfulness, completeness, tone, policy compliance, refusal behavior, auditability) rather than a pass/fail label.

  3. 3

    Prevent goalpost drift by aligning stakeholders early on what “good” means, including acceptable uncertainty and fatal error conditions.

  4. 4

    Design evaluations so they reward calibrated uncertainty and provenance, not just confident answers or proxy metrics.

  5. 5

    Expect measurement distortions: once a metric becomes the target, systems may optimize for the metric instead of the underlying intent (reward hacking).

  6. 6

    Build correctness into prompts and system workflows by specifying expected output quality and clear failure modes.

  7. 7

    Reframe correctness as allowed claims with required evidence and explicit penalties for being wrong versus staying silent.

Highlights

Correctness can’t be left vague: if it can’t be defined, it can’t be measured—and everything downstream becomes architecture built on a shifting target.
Hallucinations often persist because evaluation and reward setups can reward confident guessing instead of honest uncertainty.
Measurement distorts behavior: proxy metrics and reward design can drive optimization that misses the real objective.
A practical framework for reliability is allowed claims + required evidence + penalties for being wrong versus refusing.

Topics

Mentioned