Shipping with Codex
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Codex’s agent stack was overhauled with GPT-5 Codex (optimized for Codex work) and a rewritten tool harness supporting planning, MCP, and auto context compaction.
Briefing
Codex has shifted from a “write code” assistant into a full agentic software engineer that can plan, act across tools, and verify its own work—everywhere developers build. OpenAI says the biggest change is a complete overhaul of Codex’s underlying agent: a new GPT-5 Codex reasoning model optimized for coding inside Codex, paired with a rewritten tool harness that adds planning support, MCP integration, and features like auto context compaction for longer, more complex work sessions. The result is an agent that behaves more like a senior engineer—following code style more closely, spending time thinking when needed, pushing back on bad ideas, and producing fewer “nice-sounding” but wrong suggestions.
That engine is now available across environments: IDEs, terminals, GitHub, web, and mobile, with the same agent “under the hood” regardless of where it’s invoked. The CLI also got a major usability reset after early feedback—simplified approval modes, a clearer UI, and safer-by-default behavior via sandboxing while still keeping user control. OpenAI then addressed a key workflow gap: developers wanted to collaborate with the agent while simultaneously viewing and editing code. Codex moved into the IDE as a native extension (including VS Code and Cursor forks), bundling the same open-source harness that powers the CLI.
On the infrastructure side, Codex Cloud was upgraded to run many more tasks in parallel and to make longer workflows practical. Cloud tasks can automatically set up dependencies and verify outputs by taking screenshots—an approach OpenAI describes as “magical” when it works, because it gives the agent a way to prove what it changed. The agent’s reach is also expanding into collaboration tools like GitHub and Slack, where it can ingest context from threads and return solutions with summaries.
A major bottleneck now sits after code generation: review. OpenAI says code review validation has become a bottleneck as teams ship faster, and earlier attempts were too noisy. The fix is a dedicated, ultra thorough code review capability: GPT-5 Codex trained to inspect dependencies and code deeply inside a container, exploring how intent maps to implementation. OpenAI claims many teams enable it by default and even consider making it mandatory, with options to trigger during pairing or to automate review on every GitHub pull request.
The talk also tied these capabilities to measurable internal impact: OpenAI reports that 92% of technical staff use Codex daily (up from about 50% last July), engineers using Codex submit 70% more PRs per week, and “pretty much all” PRs get reviewed by Codex. Bugs are reportedly caught earlier, and teams respond positively when issues are surfaced.
Real workflows illustrated how verification loops scale. On iOS, Nacho Sto described using Codex to implement UI from a mockup, then verify correctness with test-driven development plus multimodal checks—generating SwiftUI preview snapshots and using screenshots to confirm pixel-level UI behavior. Fel showed how long-running sessions can be managed with structured planning: Codex produces a living plans.md design document, iterates through spikes and implementation, runs extensive property tests and fuzzing, and ultimately produces a pull request with thousands of lines of code after sustained work. Daniel then demonstrated local and GitHub code review loops using slash commands, including a separate review thread to reduce bias and a workflow that iterates review → fix → re-review until the PR earns final approval.
Overall, Codex’s direction is clear: faster shipping with higher confidence, achieved by combining agentic planning, tool execution, and rigorous self-verification—then wrapping it in developer-friendly interfaces and automation.
Cornell Notes
Codex is being positioned as an AI “software engineer” that can plan, modify code across tools, and verify results—turning it from a coding helper into an agentic workflow. OpenAI credits the shift to a revamped agent stack: a GPT-5 Codex model optimized for Codex work plus a rewritten tool harness with planning, MCP support, and context management for long sessions. Verification is central: GPT-5 Codex is trained for ultra thorough code review inside containers, and teams can run it during pairing or automatically on every GitHub PR. Internal metrics claim broad adoption and faster throughput, while demos show practical loops for UI correctness (snapshot screenshots) and long refactors (living plans.md with extensive tests).
What changed in Codex’s “agent” design, and why does it matter for real engineering work?
How does Codex’s availability across environments change day-to-day usage?
What does “verification” look like beyond unit tests in the iOS workflow?
How can Codex handle long, complex refactors without losing coherence?
Why is code review treated as a first-class feature, and how does the review loop work?
What adoption and productivity signals did OpenAI report internally?
Review Questions
- How do the reasoning model and tool harness changes work together to enable longer, more reliable agent sessions?
- What mechanisms in the iOS workflow provide visual verification, and how are they integrated into the test loop?
- Describe the role of plans.md in managing long refactors—what information does it contain and how does it keep the agent aligned?
Key Points
- 1
Codex’s agent stack was overhauled with GPT-5 Codex (optimized for Codex work) and a rewritten tool harness supporting planning, MCP, and auto context compaction.
- 2
Codex is now available across IDEs, terminals, GitHub, web, and mobile using the same underlying agent, with sandboxing enabled by default in the CLI.
- 3
IDE integration matters: Codex ships as a native extension (VS Code and Cursor forks) so developers can collaborate with the agent while viewing code in place.
- 4
Codex Cloud scales execution by running many tasks in parallel and enabling automated verification via dependency setup and screenshot-based checks.
- 5
Ultra thorough code review is treated as a core capability: GPT-5 Codex reviews deeply inside containers and can run during pairing or automatically on every GitHub PR.
- 6
Internal adoption is reported as broad (92% daily usage) with productivity gains (70% more PRs per week) and near-universal Codex review coverage.
- 7
Verification loops can be extended from unit tests to multimodal UI checks (screenshots) and to long-running refactors using living plans.md plus extensive testing (including fuzzing).