AI Makes you Dumb and Slow

TL;DR

A randomized controlled trial found that allowing AI assistance increased observed implementation time by about 19% for experienced open-source developers on real repo issues.

Briefing Cornell Notes

Briefing

A randomized controlled trial of early-2025 AI coding tools found a counterintuitive result: experienced open-source developers took about 19% longer to complete real, repo-based tasks when AI assistance was allowed—an outcome that clashes with both developer expectations and expert forecasts. The study frames this as a snapshot of tool capability in a realistic setting, not a claim about AI’s ultimate potential. Still, the magnitude is striking enough to force a rethink of how “productivity” should be measured when AI changes the workflow.

Researchers recruited 16 experienced contributors from large open-source repositories (averaging roughly five years of prior experience per developer). Across 246 issues—bug fixes, features, and refactors—each developer worked on tasks under randomized conditions: either AI use was disallowed or AI use was allowed. When AI was allowed, developers could use tools they chose, primarily Cursor Pro with Claude 3.5 Sonnet and Frontier models (as available during the study). The core metric was implementation time for each issue, with developers recording screen activity and self-reporting total implementation time.

The headline finding: with AI enabled, realized completion times increased by 19% (the transcript also mentions a “99% longer” claim, but the more concrete figure repeatedly referenced is 19%). Developers’ own forecasts were far more optimistic. Before starting, participants predicted AI would reduce completion time by about 24%, and afterward still believed it had sped them up by roughly 20%—despite the observed slowdown. That gap between perception and measured performance becomes one of the study’s most important signals: self-reports and intuition may not track actual throughput when AI is involved.

The trial also digs into why benchmark-style success might not translate into real coding speed. The researchers argue that common software benchmarks often trade realism for scale: tasks can be self-contained, algorithmically scored, and missing the messy context of live repositories, human review standards, and iterative debugging. In real development, AI can produce code that is functionally correct but verbose, or it can fragment work into smaller subtasks without reducing total effort. Even when AI helps, downstream costs—review time, cleanup, and integration—may erase speed gains.

To address concerns about experimental validity, the study reports that the slowdown persisted across multiple analyses and outcome measures, and it attempted to rule out artifacts such as differential dropout or quality differences in submitted pull requests. It also lists plausible contributing factors, including over-optimistic forecasts, repository familiarity effects, reduced AI reliability in large complex codebases, low acceptance rates for AI generations (under 44%), and missing implicit repository context.

The transcript emphasizes that the result is not a universal verdict on AI. The authors explicitly avoid claiming that AI never speeds up developers or that these developers and repositories represent all software work. Instead, the study is positioned as evidence that early-2025 AI tools can slow experienced contributors in a specific, high-standard environment—and that reconciling benchmark scores, anecdotes, and field performance likely requires multiple evaluation methods. The takeaway for the broader AI debate: measuring “capability” isn’t the same as measuring “usefulness,” and workflow friction can dominate outcomes even when models look strong on tests.

Cornell Notes

A randomized controlled trial of early-2025 AI coding assistance found that experienced open-source developers completed real repository tasks about 19% more slowly when AI use was allowed. The slowdown conflicted with both developer self-forecasts (predicting ~24% faster) and post-task beliefs (still estimating ~20% faster), highlighting a large perception gap. The study used 16 developers and 246 issues from large, complex repos, with AI access primarily via Cursor Pro using Claude 3.5 Sonnet and Frontier models. Researchers argue that benchmark success may not translate to real productivity because real coding includes context, review standards, integration, and cleanup costs. The results are framed as setting-specific evidence rather than a universal claim about AI’s future impact.

How did the study measure “productivity” and why does that matter for interpreting the results?

Productivity was operationalized as implementation/real completion time per issue. Developers recorded screen activity and self-reported total implementation time. That choice matters because AI can change what “work” looks like—e.g., producing more verbose code or splitting one task into several smaller ones—without necessarily reducing the true effort required. Measuring time directly helps avoid misleading proxies like “lines of code” or “tasks completed,” which can reward behavior that doesn’t translate into faster delivery.

What was the experimental design, and how did randomization support causal claims?

Each of 246 real issues was randomly assigned to conditions where AI use was either disallowed or allowed. Developers worked on issues from large open-source repositories (average ~5 years experience per developer). Because AI assignment was randomized, the analysis could estimate the causal effect of AI on completion time while adjusting for task difficulty using a developer forecast of time without AI (a proxy for how hard the issue seemed before AI was available).

Why might developers believe AI speeds them up even when measured time increases?

The transcript highlights a large perception gap: participants forecast AI would reduce completion time by about 24% and still believed it sped them up by ~20% after experiencing the slowdown. One plausible mechanism discussed is workflow disruption: AI may increase iteration cycles (generate → review → cleanup → regenerate) and introduce waiting or distraction, which can feel like progress while still extending total elapsed time. Low acceptance rates for AI generations (under 44%) and frequent major edits to clean up AI code (about 9% of time spent reviewing/cleaning) also support the idea that “AI output” can create extra overhead that people don’t fully internalize.

How do benchmark results and real-world coding outcomes diverge in this framing?

The study argues that many benchmarks sacrifice realism for scale: tasks are often self-contained, algorithmically scored, and missing the implicit context and integration constraints of real repositories. Real coding also includes human review standards (style, testing, documentation), and AI can succeed at narrow benchmark tasks while failing to reduce end-to-end time in messy, high-quality environments. The transcript also notes that benchmarks may overestimate usefulness because they don’t capture bottlenecks a human would fix during real usage.

What specific factors were proposed to explain the slowdown?

The transcript lists several likely contributors: (1) overly optimistic developer forecasts about AI usefulness, (2) stronger slowdowns on issues developers were more familiar with, (3) reports that AI performs worse in large, complex environments, (4) lower AI reliability, (5) low acceptance of AI generations (under 44%), (6) frequent major changes to clean up AI code, (7) implicit repository context missing from AI outputs, and (8) the possibility that developers’ experience with the repository makes it harder for generic AI suggestions to fit cleanly.

What does the study avoid claiming, and why is that boundary important?

The authors explicitly avoid claiming that AI currently never speeds up developers or that these developers/repositories represent the majority of software development work. They also avoid generalizing beyond software development and beyond the exact setting studied. That boundary matters because the result is setting-specific: different languages, repositories, AI tools, and developer familiarity could produce different outcomes, especially as models and workflows improve.

Review Questions

What does the study treat as a proxy for task difficulty, and how does that proxy help isolate the effect of AI on completion time?
List at least three mechanisms that could turn “AI-generated progress” into longer real completion times in a repository workflow.
Why might benchmark scores and anecdotal reports both be compatible with a field trial showing a slowdown?

Key Points

1
A randomized controlled trial found that allowing AI assistance increased observed implementation time by about 19% for experienced open-source developers on real repo issues.
2
Developer forecasts and post-task beliefs were substantially more optimistic than the measured outcome, indicating a major perception gap about AI’s speed impact.
3
Measuring time-to-completion helps avoid misleading productivity proxies like lines of code or number of subtasks, which AI can inflate without reducing effort.
4
Benchmark-style success may not translate to real productivity because real coding includes implicit repository context, integration, and human review/cleanup overhead.
5
The study reports the slowdown persisted across multiple analyses and attempted to rule out common experimental artifacts such as differential dropout or quality differences in submissions.
6
Low acceptance of AI generations and frequent cleanup of AI-produced code are central to the proposed explanation for why AI can slow experienced contributors.
7
The findings are framed as setting-specific evidence rather than a universal claim about AI’s future ability to accelerate software work.

Highlights

Experienced open-source developers took ~19% longer to finish issues when AI use was allowed, despite expecting AI to make them faster.

Participants predicted ~24% faster performance from AI and still believed AI sped them up by ~20% after the fact—yet the measured times moved the other direction.

The study argues that benchmark success can coexist with real slowdowns because benchmarks often miss the context, review standards, and integration friction of live repositories.

Low acceptance rates (under 44%) and cleanup-heavy workflows suggest that AI can add iteration overhead even when it produces usable code.

Topics

Randomized Controlled Trial
Developer Productivity
AI Coding Tools
Benchmark vs Field
Open-Source Repositories

Mentioned

Cursor Pro
Claude 3.5 Sonnet
Frontier
RCT
AI