I just tried o3-mini
Based on The PrimeTime's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
o3-mini showed faster response times in this test, often landing in about 5–10 seconds versus ~30 seconds for o1 mini.
Briefing
o3-mini delivers a noticeable speed boost for coding tasks—often returning responses in roughly 5 to 10 seconds versus around 30 seconds with o1 mini—but it doesn’t amount to a fundamental leap in programming capability. After roughly an hour and a half building a small Go-based plugin that counts “ones” and “twos” in chat and displays results on a web page via WebSockets, the model produced working code quickly. The output wasn’t “good code” by the builder’s standards, yet it was impressive that it could generate an end-to-end feature set from scratch.
Where the experience felt less revolutionary was in usability and correctness under real interaction constraints. The voting mechanism—where one user could cast multiple votes—required careful handling to land updates in the right place. The builder found that the model missed some basic usability fundamentals, and the resulting implementation needed steering. The project also highlighted how much of the work still depends on solid engineering judgment: the system used Go for the backend, inline HTML and JavaScript for the frontend, and relied on global variables and stringified JavaScript. Seeing the generated code made it clear that programming knowledge is still the difference between “getting something working” and building something maintainable.
The builder’s broader takeaway is that hype cycles are likely overselling the practical impact. While o3-mini felt faster, there wasn’t a clear “new iteration” in day-to-day programming effectiveness. Comparisons to Claude (including a Claude experience that felt “pretty much the same”) reinforced the sense that improvements are incremental rather than transformative—especially for small, isolated “feature islands” like a 300-line plugin. Scaling to much larger systems (features that can run 10,000–20,000 lines within a broader ecosystem) remains an open question, since the testbed was too small to judge how well the model handles complex, extensible architecture.
Despite the lack of a step-change, the builder argues that experience matters more, not less. Faster code generation increases the need to understand how changes affect the whole codebase, where stress points emerge, and how to refactor toward extensibility. Without that background, it’s easy to introduce expensive technical debt that only becomes obvious later. The practical conclusion: don’t assume the world is ending or that job loss is imminent purely because models generate code faster.
Looking ahead, the builder plans more testing with additional hardware and access to GPU resources, including SSH access to a small cluster and an invitation in Florida to evaluate more extensive setups. There’s also mention of R1 being easier to use and that even a 32B variant feels better than o1 mini, suggesting the next round of comparisons will focus on real-world performance beyond a single small prototype.
Cornell Notes
o3-mini is faster for coding than o1 mini, with many responses landing in about 5–10 seconds instead of ~30 seconds. In a small Go + WebSockets + inline HTML/JavaScript plugin (about 300 lines), it generated working code quickly, but the output wasn’t “good code,” and it missed some usability fundamentals in a multi-vote mechanism. The experience suggests improvements are more incremental than transformative for practical, day-to-day programming—especially in small, isolated features. The builder argues that programming fundamentals and experience remain crucial because generated code still needs architectural judgment, refactoring, and risk awareness as projects scale. Future tests are planned on more extensive GPU setups and with other models like R1.
What concrete speed difference did o3-mini show compared with o1 mini in this hands-on test?
What kind of project did the builder use to evaluate o3-mini, and what stack did it involve?
Where did o3-mini fall short in the builder’s view?
Why does the builder think the results don’t prove a “fundamental shift” in programming ability?
What’s the central argument about why programming experience still matters?
What future comparisons or testing plans were mentioned?
Review Questions
- How did the builder’s test design (a ~300-line feature island) limit conclusions about performance at larger scale?
- What specific usability issue in the voting mechanism required steering, and why does that matter for real applications?
- According to the builder, why does faster code generation increase the value of programming experience rather than reduce it?
Key Points
- 1
o3-mini showed faster response times in this test, often landing in about 5–10 seconds versus ~30 seconds for o1 mini.
- 2
A small Go + WebSockets + inline HTML/JavaScript plugin was built successfully, demonstrating end-to-end code generation capability.
- 3
The generated code wasn’t considered “good code,” and the voting mechanism missed some basic usability/correctness fundamentals, requiring steering.
- 4
Improvements felt incremental rather than a fundamental jump in practical programming effectiveness, especially for small isolated features.
- 5
The test size (~300 lines) wasn’t enough to predict behavior for much larger components (10,000–20,000 lines) where extensibility and architecture dominate.
- 6
Programming experience remains essential because generated code still needs refactoring, risk assessment, and system-wide impact control.
- 7
More extensive hardware and model comparisons were planned, including additional GPU resources and testing with R1.