Anthropic's New Benchmark Changes Everything—Most People Will Miss Why
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
MER-style evaluations measure how long an AI can sustain useful agentic work, using success likelihood thresholds such as 50% and 80%.
Briefing
The most consequential shift in AI agent progress is moving from “benchmarks that cap out” to evaluations that keep climbing—meaning models can sustain useful work for longer and longer stretches, with no obvious ceiling. That matters because it turns AI from a tool that helps in short bursts into a system that can execute multi-hour, multi-day tasks with measurable success rates, reshaping how work is delegated, verified, and rewarded.
A nonprofit called MER (Model Evaluation and Threat Research) uses a metric called MER (model evaluation and threat research) to quantify how long an AI can perform an agentic task before it stops being reliably useful. The method is simple: compare the task’s human completion time to the AI’s ability to reach at least a 50% likelihood of success (and also an 80% threshold). The key difference from many popular benchmarks—such as Swebench, which tops out around 100%—is that MER’s agentic-work graph doesn’t flatten. When there’s no top end, incremental improvements still signal real capability growth rather than “stuck at the ceiling” noise.
That framing is used to argue that AI progress is not just exponential but super exponential—improving faster than exponential growth. The latest cited evidence is Opus 4.5, which reportedly reaches nearly five hours of human-equivalent work at the 50% success mark (about 4 hours 45 minutes). At the 80% mark, the figure is given as 27–28 minutes for Opus 4.5. The comparison point is the speed of change: earlier systems were completing tasks in minutes, then tens of minutes, and now approaching multi-hour performance. The claim is that this implies a doubling cadence roughly every four to four-and-a-half months.
If that trajectory holds, the practical timeline becomes stark: around 10 hours by the end of Q1, 20 hours by Q2 into Q3, and potentially 40 hours by year’s end. The argument then links the measurement trend to a feedback loop. Super exponential gains are interpreted as signs of a self-reinforcing flywheel: AI systems help train and improve other AI systems, automating parts of the pipeline and accelerating progress on tasks that don’t have an upper limit.
The workforce implications are framed around a single operational question for 2026: can individuals delegate a week’s worth of meaningful work to agents? In a super exponential world, the “skill to assign agents” must also evolve quickly—people who build the workflow now (early in the year) will have an advantage as agents become capable of harder, longer-horizon tasks. Waiting to “catch up” later is portrayed as a losing strategy because the gap compounds.
The transcript also predicts a power-law labor market: only a few people will be able to leverage agent teams to produce outsized outcomes, not necessarily because they have more money, but because they develop the relevant skills. That shifts career strategy away from traditional job-family checklists toward outcome obsession, ownership, and “good taste” in defining what excellent looks like. It also spreads technical and non-technical requirements across roles—engineers may need business and customer fluency to set quality targets, while non-technical workers must learn to direct and hold agentic systems accountable.
Finally, the argument insists domain expertise still matters. Agents can expand scope and transform workflows, but they won’t replace deep experience in every context—especially where quality depends on years of judgment. The central takeaway is that the MER-style agentic-work graph is treated as a leading indicator: as models sustain longer work cycles, the entire structure of delegation, verification, and career progression is expected to change for everyone.
Cornell Notes
Agent capability is being measured in terms of how long useful work can be sustained, not just whether a model can solve a task once. MER’s evaluation framework ties success likelihood (50% and 80%) to the time horizon, and—unlike capped benchmarks such as Swebench—its agentic-work graph has no obvious top end. Using Opus 4.5 as an example, the transcript cites nearly five hours of human-equivalent work at 50% success, with 27–28 minutes at 80%, and argues this reflects super exponential progress (doubling roughly every 4–4.5 months). If that continues, delegating a week’s worth of work to agents becomes the key 2026 skill, driving a power-law distribution of productivity and forcing new “good taste” and accountability practices across job families.
What does MER measure, and why does the “no top end” feature matter for interpreting progress?
How do the Opus 4.5 numbers illustrate the difference between 50% and 80% success thresholds?
What does “super exponential” mean in this context, and what timeline does it imply?
Why does the transcript connect super exponential gains to a self-reinforcing AI flywheel?
What is the “delegating a week’s worth of work” claim, and why is it treated as a 2026 inflection point?
How does the transcript describe the likely labor-market outcome—power laws and skill-based rewards?
Review Questions
- How does MER’s success-likelihood/time-horizon approach differ from capped benchmarks like Swebench, and what interpretive advantage does that provide?
- What do the transcript’s Opus 4.5 figures imply about the tradeoff between 50% and 80% success thresholds over time?
- According to the transcript, what set of skills determines who can delegate a week’s worth of work to agents in 2026?
Key Points
- 1
MER-style evaluations measure how long an AI can sustain useful agentic work, using success likelihood thresholds such as 50% and 80%.
- 2
Benchmarks that cap out (like Swebench at 100%) can hide real progress, while uncapped agentic-work graphs keep signaling capability growth.
- 3
Opus 4.5 is cited as reaching nearly five hours of human-equivalent work at 50% success, while the 80% threshold corresponds to roughly 27–28 minutes.
- 4
The transcript links rising agentic work time to super exponential growth, projecting doubling on a ~4 to 4.5 month cadence and multi-hour to multi-day horizons within a year.
- 5
If AI progress is self-reinforcing—AI helping train AI—workflows and workforce expectations are likely to shift rapidly in 2026 and beyond.
- 6
Career advantage is framed as outcome and ownership focused: defining “good taste,” assigning agent tasks well, holding agents accountable, and intervening to maintain quality.
- 7
Deep domain expertise still matters, but the ability to direct agents toward useful ends is expected to spread across job families.