I just tried o3-mini

TL;DR

o3-mini showed faster response times in this test, often landing in about 5–10 seconds versus ~30 seconds for o1 mini.

Briefing Cornell Notes

Briefing

o3-mini delivers a noticeable speed boost for coding tasks—often returning responses in roughly 5 to 10 seconds versus around 30 seconds with o1 mini—but it doesn’t amount to a fundamental leap in programming capability. After roughly an hour and a half building a small Go-based plugin that counts “ones” and “twos” in chat and displays results on a web page via WebSockets, the model produced working code quickly. The output wasn’t “good code” by the builder’s standards, yet it was impressive that it could generate an end-to-end feature set from scratch.

Where the experience felt less revolutionary was in usability and correctness under real interaction constraints. The voting mechanism—where one user could cast multiple votes—required careful handling to land updates in the right place. The builder found that the model missed some basic usability fundamentals, and the resulting implementation needed steering. The project also highlighted how much of the work still depends on solid engineering judgment: the system used Go for the backend, inline HTML and JavaScript for the frontend, and relied on global variables and stringified JavaScript. Seeing the generated code made it clear that programming knowledge is still the difference between “getting something working” and building something maintainable.

The builder’s broader takeaway is that hype cycles are likely overselling the practical impact. While o3-mini felt faster, there wasn’t a clear “new iteration” in day-to-day programming effectiveness. Comparisons to Claude (including a Claude experience that felt “pretty much the same”) reinforced the sense that improvements are incremental rather than transformative—especially for small, isolated “feature islands” like a 300-line plugin. Scaling to much larger systems (features that can run 10,000–20,000 lines within a broader ecosystem) remains an open question, since the testbed was too small to judge how well the model handles complex, extensible architecture.

Despite the lack of a step-change, the builder argues that experience matters more, not less. Faster code generation increases the need to understand how changes affect the whole codebase, where stress points emerge, and how to refactor toward extensibility. Without that background, it’s easy to introduce expensive technical debt that only becomes obvious later. The practical conclusion: don’t assume the world is ending or that job loss is imminent purely because models generate code faster.

Looking ahead, the builder plans more testing with additional hardware and access to GPU resources, including SSH access to a small cluster and an invitation in Florida to evaluate more extensive setups. There’s also mention of R1 being easier to use and that even a 32B variant feels better than o1 mini, suggesting the next round of comparisons will focus on real-world performance beyond a single small prototype.

Cornell Notes

o3-mini is faster for coding than o1 mini, with many responses landing in about 5–10 seconds instead of ~30 seconds. In a small Go + WebSockets + inline HTML/JavaScript plugin (about 300 lines), it generated working code quickly, but the output wasn’t “good code,” and it missed some usability fundamentals in a multi-vote mechanism. The experience suggests improvements are more incremental than transformative for practical, day-to-day programming—especially in small, isolated features. The builder argues that programming fundamentals and experience remain crucial because generated code still needs architectural judgment, refactoring, and risk awareness as projects scale. Future tests are planned on more extensive GPU setups and with other models like R1.

What concrete speed difference did o3-mini show compared with o1 mini in this hands-on test?

The builder reported that o3-mini responses often arrived in roughly 5 to 10 seconds, while o1 mini frequently took about 30 seconds. The perceived improvement was significant enough to call o3-mini “like a good three times faster,” at least for the kinds of coding prompts used during the session.

What kind of project did the builder use to evaluate o3-mini, and what stack did it involve?

The test was a small plugin that counts numbers of “ones” and “twos” in chat and displays the results on a web page. The implementation used Go for the backend, WebSockets for real-time updates, and inline HTML plus JavaScript on the frontend. The generated code also included global variables and stringified JavaScript, reflecting a fairly direct approach to wiring UI updates to backend events.

Where did o3-mini fall short in the builder’s view?

The builder said the model missed some basic usability and correctness details in a voting mechanism where one person could vote multiple times. Getting the voting updates into the correct location required steering, implying the model didn’t automatically handle the interaction logic cleanly.

Why does the builder think the results don’t prove a “fundamental shift” in programming ability?

The prototype was only about 300 lines, described as an isolated “feature island.” The builder argued that this size can’t reliably predict performance when features expand into much larger components—on the order of 10,000 to 20,000 lines within a bigger ecosystem—where architecture, extensibility, and long-range code impacts matter more.

What’s the central argument about why programming experience still matters?

Generated code can be produced faster, but that increases the need for experienced judgment. The builder emphasized that without programming experience, it’s easy to create code that works initially but becomes costly later—especially when refactoring is needed to address stress points and make software extensible. In their view, more powerful models raise the bar for controlling and evaluating the system-wide impact of changes.

What future comparisons or testing plans were mentioned?

The builder planned to test more extensive setups, including building out a local R1 setup with help from a friend, aiming for multiple GPUs (mentioned as “3 390s”), and using SSH access to a small GPU cluster. They also mentioned an invitation in Florida with many GPUs to run broader tests and assess how models behave at larger scale.

Review Questions

How did the builder’s test design (a ~300-line feature island) limit conclusions about performance at larger scale?
What specific usability issue in the voting mechanism required steering, and why does that matter for real applications?
According to the builder, why does faster code generation increase the value of programming experience rather than reduce it?

Key Points

1
o3-mini showed faster response times in this test, often landing in about 5–10 seconds versus ~30 seconds for o1 mini.
2
A small Go + WebSockets + inline HTML/JavaScript plugin was built successfully, demonstrating end-to-end code generation capability.
3
The generated code wasn’t considered “good code,” and the voting mechanism missed some basic usability/correctness fundamentals, requiring steering.
4
Improvements felt incremental rather than a fundamental jump in practical programming effectiveness, especially for small isolated features.
5
The test size (~300 lines) wasn’t enough to predict behavior for much larger components (10,000–20,000 lines) where extensibility and architecture dominate.
6
Programming experience remains essential because generated code still needs refactoring, risk assessment, and system-wide impact control.
7
More extensive hardware and model comparisons were planned, including additional GPU resources and testing with R1.

Highlights

o3-mini’s standout advantage in this hands-on run was speed: many responses arrived in 5–10 seconds instead of the ~30 seconds typical of o1 mini.

Even with working output, the builder found gaps in usability—especially in a multi-vote mechanism where correct placement of updates required extra steering.

The builder’s main caution: small prototypes can’t prove large-scale architectural gains, so hype should be tempered until bigger systems are tested.

Topics

o3-mini
Coding Speed
WebSockets
Code Quality
Model Comparisons