Get AI summaries of any video or article — Sign up free
Anthropic's New Benchmark Changes Everything—Most People Will Miss Why thumbnail

Anthropic's New Benchmark Changes Everything—Most People Will Miss Why

6 min read

Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

MER-style evaluations measure how long an AI can sustain useful agentic work, using success likelihood thresholds such as 50% and 80%.

Briefing

The most consequential shift in AI agent progress is moving from “benchmarks that cap out” to evaluations that keep climbing—meaning models can sustain useful work for longer and longer stretches, with no obvious ceiling. That matters because it turns AI from a tool that helps in short bursts into a system that can execute multi-hour, multi-day tasks with measurable success rates, reshaping how work is delegated, verified, and rewarded.

A nonprofit called MER (Model Evaluation and Threat Research) uses a metric called MER (model evaluation and threat research) to quantify how long an AI can perform an agentic task before it stops being reliably useful. The method is simple: compare the task’s human completion time to the AI’s ability to reach at least a 50% likelihood of success (and also an 80% threshold). The key difference from many popular benchmarks—such as Swebench, which tops out around 100%—is that MER’s agentic-work graph doesn’t flatten. When there’s no top end, incremental improvements still signal real capability growth rather than “stuck at the ceiling” noise.

That framing is used to argue that AI progress is not just exponential but super exponential—improving faster than exponential growth. The latest cited evidence is Opus 4.5, which reportedly reaches nearly five hours of human-equivalent work at the 50% success mark (about 4 hours 45 minutes). At the 80% mark, the figure is given as 27–28 minutes for Opus 4.5. The comparison point is the speed of change: earlier systems were completing tasks in minutes, then tens of minutes, and now approaching multi-hour performance. The claim is that this implies a doubling cadence roughly every four to four-and-a-half months.

If that trajectory holds, the practical timeline becomes stark: around 10 hours by the end of Q1, 20 hours by Q2 into Q3, and potentially 40 hours by year’s end. The argument then links the measurement trend to a feedback loop. Super exponential gains are interpreted as signs of a self-reinforcing flywheel: AI systems help train and improve other AI systems, automating parts of the pipeline and accelerating progress on tasks that don’t have an upper limit.

The workforce implications are framed around a single operational question for 2026: can individuals delegate a week’s worth of meaningful work to agents? In a super exponential world, the “skill to assign agents” must also evolve quickly—people who build the workflow now (early in the year) will have an advantage as agents become capable of harder, longer-horizon tasks. Waiting to “catch up” later is portrayed as a losing strategy because the gap compounds.

The transcript also predicts a power-law labor market: only a few people will be able to leverage agent teams to produce outsized outcomes, not necessarily because they have more money, but because they develop the relevant skills. That shifts career strategy away from traditional job-family checklists toward outcome obsession, ownership, and “good taste” in defining what excellent looks like. It also spreads technical and non-technical requirements across roles—engineers may need business and customer fluency to set quality targets, while non-technical workers must learn to direct and hold agentic systems accountable.

Finally, the argument insists domain expertise still matters. Agents can expand scope and transform workflows, but they won’t replace deep experience in every context—especially where quality depends on years of judgment. The central takeaway is that the MER-style agentic-work graph is treated as a leading indicator: as models sustain longer work cycles, the entire structure of delegation, verification, and career progression is expected to change for everyone.

Cornell Notes

Agent capability is being measured in terms of how long useful work can be sustained, not just whether a model can solve a task once. MER’s evaluation framework ties success likelihood (50% and 80%) to the time horizon, and—unlike capped benchmarks such as Swebench—its agentic-work graph has no obvious top end. Using Opus 4.5 as an example, the transcript cites nearly five hours of human-equivalent work at 50% success, with 27–28 minutes at 80%, and argues this reflects super exponential progress (doubling roughly every 4–4.5 months). If that continues, delegating a week’s worth of work to agents becomes the key 2026 skill, driving a power-law distribution of productivity and forcing new “good taste” and accountability practices across job families.

What does MER measure, and why does the “no top end” feature matter for interpreting progress?

MER’s framework measures how long an AI can perform an agentic task relative to human time, while also tracking success likelihood thresholds. It compares the human completion time for a task and then asks whether the AI can achieve at least a 50% likelihood of success (and separately an 80% threshold). The crucial interpretive point is that some benchmarks cap out (Swebench is cited as topping at 100%), so improvements near the ceiling can look flat even when real capability changes. MER’s agentic-work graph is described as not topping out, so rising performance continues to reflect meaningful capability gains rather than “ceiling effects.”

How do the Opus 4.5 numbers illustrate the difference between 50% and 80% success thresholds?

Opus 4.5 is cited as reaching nearly five hours (about 4 hours 45 minutes) of human-equivalent work at the 50% likelihood of success mark. At the stricter 80% threshold, the cited figure is about 27–28 minutes. The transcript uses this gap to show that higher reliability shortens the time horizon, but the overall trend still signals expanding sustained capability as the 50% time horizon grows from minutes to hours.

What does “super exponential” mean in this context, and what timeline does it imply?

“Super exponential” is used to mean growth faster than exponential—improvements that accelerate rather than just compound. The transcript links this to a roughly doubling cadence every 4 to 4.5 months. It then projects a practical time horizon expansion: if the 50% success horizon is around 4–5 hours, it could reach about 10 hours by the end of Q1, 20 hours by Q2 into Q3, and 40 hours by year’s end (or beyond), assuming the trend continues.

Why does the transcript connect super exponential gains to a self-reinforcing AI flywheel?

The argument is that AI progress may be feeding back into itself. As models improve, they can help train and refine other models, and more of that pipeline becomes automated over time. That creates a reinforcing loop: better AI systems accelerate the creation of even better systems, especially for hard tasks that don’t have an upper limit—so progress can keep accelerating rather than leveling off.

What is the “delegating a week’s worth of work” claim, and why is it treated as a 2026 inflection point?

The transcript frames 2026 around whether individuals can delegate a week’s worth of meaningful work to agents. As agentic systems can sustain longer task execution, productivity advantage shifts toward people who can define high-quality goals (“good taste”), assign work precisely, set accountability, and intervene when needed. Because agent capability is improving quickly, the skill to assign and manage agent work must also be learned early; waiting to “catch up” later is portrayed as increasingly difficult as the gap widens.

How does the transcript describe the likely labor-market outcome—power laws and skill-based rewards?

It predicts a power-law distribution: not everyone benefits equally, and a small number of people can produce disproportionately large outcomes. The mechanism is skill—especially skills tied to directing AI agents toward useful ends—rather than simply having more resources. In this view, traditional career requirements may matter less than the ability to leverage agent teams effectively, leading to productivity stratification across roles.

Review Questions

  1. How does MER’s success-likelihood/time-horizon approach differ from capped benchmarks like Swebench, and what interpretive advantage does that provide?
  2. What do the transcript’s Opus 4.5 figures imply about the tradeoff between 50% and 80% success thresholds over time?
  3. According to the transcript, what set of skills determines who can delegate a week’s worth of work to agents in 2026?

Key Points

  1. 1

    MER-style evaluations measure how long an AI can sustain useful agentic work, using success likelihood thresholds such as 50% and 80%.

  2. 2

    Benchmarks that cap out (like Swebench at 100%) can hide real progress, while uncapped agentic-work graphs keep signaling capability growth.

  3. 3

    Opus 4.5 is cited as reaching nearly five hours of human-equivalent work at 50% success, while the 80% threshold corresponds to roughly 27–28 minutes.

  4. 4

    The transcript links rising agentic work time to super exponential growth, projecting doubling on a ~4 to 4.5 month cadence and multi-hour to multi-day horizons within a year.

  5. 5

    If AI progress is self-reinforcing—AI helping train AI—workflows and workforce expectations are likely to shift rapidly in 2026 and beyond.

  6. 6

    Career advantage is framed as outcome and ownership focused: defining “good taste,” assigning agent tasks well, holding agents accountable, and intervening to maintain quality.

  7. 7

    Deep domain expertise still matters, but the ability to direct agents toward useful ends is expected to spread across job families.

Highlights

MER’s uncapped agentic-work graph is treated as a clearer signal of real capability growth than benchmarks that flatten near 100%.
Opus 4.5 is cited at nearly five hours of human-equivalent work at 50% success, contrasted with about 27–28 minutes at 80%.
The transcript’s central workforce question for 2026 is whether individuals can delegate a week’s worth of meaningful work to agents.
Super exponential gains are framed as evidence of a self-reinforcing flywheel where AI increasingly helps train and improve AI systems.
A power-law labor market is predicted, where skill at directing agents drives outsized productivity for a minority.

Topics

  • Agentic Benchmarks
  • MER Evaluation
  • Super Exponential Growth
  • Power-Law Productivity
  • Workforce Strategy

Mentioned