A realistic comparison of Opus and Codex

TL;DR

Codex 5.3 is the default choice for correctness-heavy work like migrations, PR reviews, and security audits, because it tends to avoid missing key details and handles blockers more directly.

Briefing Cornell Notes

Briefing

Codex 5.3 comes out ahead for day-to-day software work—especially when tasks involve real-world complexity like migrations, PR reviews, and “make it correct” engineering. The tradeoff is speed and vibe: Opus 4.6 often gets to a working UI or a usable first draft faster, but it more frequently leaves behind sloppy details, misses edge cases, or introduces security and correctness issues that later require cleanup.

Pricing and access set the stage for why this comparison is messy. Opus 4.6 has published API rates ($25/M tokens out, $5/M tokens in, with fast mode costing 2–3x more and 6x higher expense). Codex 5.3 isn’t broadly available over the API yet, limiting direct benchmarking; the best guess is that Codex 5.3 pricing will resemble earlier Codex tiers (e.g., Codex 5.2’s $1.75/M in and $14/M out). Even so, the creator’s practical takeaway is that Codex tends to be cheaper per token, while Opus can look better per “run” when Codex burns more tokens reasoning through correctness. In subscription terms, Codex also appears more generous in usage quotas, with the creator reporting heavy Codex usage while staying far from limits.

On “intelligence” and hard problem solving, Codex repeatedly wins—sometimes by a lot. In a difficult migration of an old codebase (Round/ping.gg, built on an early T3 stack), Codex 5.3 was the first model to succeed. Its approach: bump what it needs, patch temporarily to unblock, then remove patches once the dependency chain is stabilized—an iterative strategy that avoids the cascade failure pattern where a model upgrades everything and breaks the rest.

Opus’s strengths show up when the goal is to unblock quickly or produce front-end work that looks good. The creator describes Opus as a faster “get it working” partner, particularly for UI design and for “computer-adjacent” tasks like editing dotfiles, SSH-ing to machines, and configuring systems. Opus also sometimes succeeds where Codex fails—such as getting a migration to complete without triggering the kind of deep “fix everything” loop that can trap thorough models.

The biggest fault line is diligence versus shortcuts. Codex is portrayed as “measure twice, cut once”: it tends to notice missing details, handle blockers directly, and avoid leaving insecure or inconsistent code behind. Opus is portrayed as “measure less, ship sooner”: it may ignore blockers by trimming scope, and it can produce working code that later turns out to be wrong or insecure. The creator cites examples involving environment variable handling, database schema/type safety gaps, and even a security-relevant bug where Opus made user association nullable in image generation.

Finally, the comparison isn’t just about model IQ—it’s about platform behavior and trust. The creator prefers Codex for codebase safety and security work, but prefers Opus for the day-to-day experience: faster iteration, more pleasant interaction, and better front-end polish. They also criticize harness quirks (especially Cloud Code) and highlight that Codex’s thoroughness can sometimes become counterproductive in long-running tasks.

Bottom line: if forced to pick one model for serious engineering, Codex is the safer default. If the priority is speed, UI aesthetics, and a more enjoyable workflow, Opus remains compelling—often best used as a complementary tool rather than a replacement.

Cornell Notes

Codex 5.3 is favored for solving difficult, real engineering problems—especially migrations, code reviews, and “make it correct” tasks—because it handles blockers more directly and tends to avoid missing important details. Opus 4.6 is often faster at getting something working and is particularly strong for front-end design and system-adjacent tasks like configuring machines and editing local files. The creator’s key pattern is diligence vs. shortcuts: Codex “measures twice, cuts once,” while Opus may ship sooner but leave cleanup work (including occasional security or correctness issues). Pricing is complicated by Codex 5.3’s limited API availability, but subscription usage and token behavior suggest Codex can be more cost-effective per token and more generous in quotas. The practical recommendation is to default to Codex for codebase-critical work and use Opus when speed and UI polish matter most.

Why does Codex 5.3 repeatedly outperform Opus 4.6 on “hard problems” like migrations?

Codex’s blocker-handling strategy is described as iterative and surgical: it bumps the first dependencies it needs, then applies temporary patches to keep the build moving when breakage occurs, and finally removes those patches once the chain is stabilized. In the Round/ping.gg migration (an old T3 stack repo with tightly coupled dependencies), Codex 5.3 was the first model to succeed by patching multiple packages along the way instead of triggering a cascade where upgrading React forces TRPC, which forces React Query, and so on. The result is a PR that the creator expects to merge after review, despite the migration being ~12,000 lines and “nobody wants to do” work.

What specific kinds of mistakes does Opus 4.6 make that create extra cleanup later?

Opus is portrayed as more likely to miss details and to treat blockers by trimming scope. The creator gives examples around environment variables and type safety: Opus can claim everything is fine while failing to set the required environment variables in the correct place (e.g., Convex), and it can introduce security-relevant schema issues such as making user association nullable in image generation. Even when Opus produces a working system faster, the creator often has to run audits or follow-up fixes to correct those gaps.

How do the models differ in front-end work and UI iteration?

Opus is described as stronger at front-end design and UI code. A common workflow is to have Codex implement the core functionality first, then have Opus “fix the UI,” or to have Opus mock up a UI and then have Codex implement the full behavior. The creator also notes that Opus tends to produce visually polished interfaces quickly, while Codex focuses more on correctness and completeness, sometimes at the cost of speed or getting stuck in over-fixing loops.

What does the creator mean by “diligence can hurt” with Codex?

Codex’s thoroughness can become counterproductive in long-running tasks. In a Cursor long-running run migrating T3 chat to AISDK v6, Codex 5.3 reportedly ran for 20+ hours and added ~85,000 lines, with most of that being tests—so much that the creator suspected it had gotten trapped in a “write every test” loop. Opus, by contrast, completed a similar task in minutes (8 minutes 16 seconds) and was “mostly working,” illustrating that Codex’s carefulness can sometimes delay shipping when the task’s success criteria aren’t aligned with exhaustive correctness.

How does the creator decide which model to use in practice?

The creator’s rule-of-thumb is role-based. Codex is the default for codebase-critical work: PR reviews, security audits, and large overhauls where missing details is costly. Opus is preferred for speed and interaction quality—especially for configuring local machines and doing terminal/network tasks. They also describe a practical “handoff” pattern: use Opus to unblock Codex when Codex gets trapped, then let Codex clean up and finish correctly.

Why does the comparison include platform and harness issues, not just model quality?

The creator argues that harness behavior affects real productivity and trust. They criticize Cloud Code for issues like submitting messages before images finish uploading, stash/paste reliability problems, and inconsistent compaction behavior. They contrast this with Codex CLI/desktop experiences that are easier to steer and interrupt, making it more reliable to correct course mid-task. This matters because even a strong model can feel unusable if the tooling loses context or fails to apply changes predictably.

Review Questions

In the Round/ping.gg migration example, what specific technique did Codex use to avoid a dependency upgrade cascade?
Give one example of an Opus failure mode that required follow-up cleanup, and explain why it mattered (correctness vs. security vs. type safety).
When does Codex’s thoroughness become a liability, and what symptom did the creator observe in the long-running AISDK v6 migration?

Key Points

1
Codex 5.3 is the default choice for correctness-heavy work like migrations, PR reviews, and security audits, because it tends to avoid missing key details and handles blockers more directly.
2
Opus 4.6 often reaches a usable result faster and is especially strong for front-end design and UI polish, but it more frequently leaves behind issues that require cleanup.
3
Pricing comparisons are complicated by Codex 5.3’s limited API availability; practical subscription usage and token behavior become the main evidence for cost and quota differences.
4
Codex’s diligence can sometimes turn into over-fixing or runaway thoroughness in long-running tasks, while Opus may “ship sooner” by trimming scope or skipping certain checks.
5
A productive workflow is often complementary: use Opus to unblock when Codex gets stuck, then use Codex to finish correctly and comprehensively.
6
Harness/tooling reliability (e.g., Cloud Code vs Codex CLI/desktop) materially affects trust and day-to-day productivity, not just model capability.
7
If forced to pick one model for serious engineering, Codex is recommended; if the priority is speed and a more pleasant interaction for iterative work, Opus is a strong alternative.

Highlights

Codex 5.3 was the first model to successfully complete a difficult ~12,000-line migration of an old Round/ping.gg codebase by using temporary patches to unblock dependency chains, then removing them as progress stabilized.

Opus 4.6 can produce working code quickly but may introduce security-relevant or type-safety gaps—such as making user association nullable in image generation—requiring later audits and fixes.

Codex’s thoroughness can backfire in long-running tasks: a Cursor long-run AISDK v6 migration reportedly produced ~85,000 lines, mostly tests, suggesting a “fix everything” loop.

The creator’s practical split: Codex for codebase-critical engineering and security; Opus for front-end UI work and system/terminal-adjacent tasks where speed and iteration matter.

Topics

Model Comparison
Code Migrations
Pricing & Quotas
Security Audits
Front-End Design

Mentioned

Theo - t3․gg
API
UI
PII
SQL
TRPC
PR
CLI
SSH
UI
AI
AISDK
T3
T3 Chat
T3 Canvas
Convex
CEX
VA
MX