Shipping with Codex

TL;DR

Codex’s agent stack was overhauled with GPT-5 Codex (optimized for Codex work) and a rewritten tool harness supporting planning, MCP, and auto context compaction.

Briefing Cornell Notes

Briefing

Codex has shifted from a “write code” assistant into a full agentic software engineer that can plan, act across tools, and verify its own work—everywhere developers build. OpenAI says the biggest change is a complete overhaul of Codex’s underlying agent: a new GPT-5 Codex reasoning model optimized for coding inside Codex, paired with a rewritten tool harness that adds planning support, MCP integration, and features like auto context compaction for longer, more complex work sessions. The result is an agent that behaves more like a senior engineer—following code style more closely, spending time thinking when needed, pushing back on bad ideas, and producing fewer “nice-sounding” but wrong suggestions.

That engine is now available across environments: IDEs, terminals, GitHub, web, and mobile, with the same agent “under the hood” regardless of where it’s invoked. The CLI also got a major usability reset after early feedback—simplified approval modes, a clearer UI, and safer-by-default behavior via sandboxing while still keeping user control. OpenAI then addressed a key workflow gap: developers wanted to collaborate with the agent while simultaneously viewing and editing code. Codex moved into the IDE as a native extension (including VS Code and Cursor forks), bundling the same open-source harness that powers the CLI.

On the infrastructure side, Codex Cloud was upgraded to run many more tasks in parallel and to make longer workflows practical. Cloud tasks can automatically set up dependencies and verify outputs by taking screenshots—an approach OpenAI describes as “magical” when it works, because it gives the agent a way to prove what it changed. The agent’s reach is also expanding into collaboration tools like GitHub and Slack, where it can ingest context from threads and return solutions with summaries.

A major bottleneck now sits after code generation: review. OpenAI says code review validation has become a bottleneck as teams ship faster, and earlier attempts were too noisy. The fix is a dedicated, ultra thorough code review capability: GPT-5 Codex trained to inspect dependencies and code deeply inside a container, exploring how intent maps to implementation. OpenAI claims many teams enable it by default and even consider making it mandatory, with options to trigger during pairing or to automate review on every GitHub pull request.

The talk also tied these capabilities to measurable internal impact: OpenAI reports that 92% of technical staff use Codex daily (up from about 50% last July), engineers using Codex submit 70% more PRs per week, and “pretty much all” PRs get reviewed by Codex. Bugs are reportedly caught earlier, and teams respond positively when issues are surfaced.

Real workflows illustrated how verification loops scale. On iOS, Nacho Sto described using Codex to implement UI from a mockup, then verify correctness with test-driven development plus multimodal checks—generating SwiftUI preview snapshots and using screenshots to confirm pixel-level UI behavior. Fel showed how long-running sessions can be managed with structured planning: Codex produces a living plans.md design document, iterates through spikes and implementation, runs extensive property tests and fuzzing, and ultimately produces a pull request with thousands of lines of code after sustained work. Daniel then demonstrated local and GitHub code review loops using slash commands, including a separate review thread to reduce bias and a workflow that iterates review → fix → re-review until the PR earns final approval.

Overall, Codex’s direction is clear: faster shipping with higher confidence, achieved by combining agentic planning, tool execution, and rigorous self-verification—then wrapping it in developer-friendly interfaces and automation.

Cornell Notes

Codex is being positioned as an AI “software engineer” that can plan, modify code across tools, and verify results—turning it from a coding helper into an agentic workflow. OpenAI credits the shift to a revamped agent stack: a GPT-5 Codex model optimized for Codex work plus a rewritten tool harness with planning, MCP support, and context management for long sessions. Verification is central: GPT-5 Codex is trained for ultra thorough code review inside containers, and teams can run it during pairing or automatically on every GitHub PR. Internal metrics claim broad adoption and faster throughput, while demos show practical loops for UI correctness (snapshot screenshots) and long refactors (living plans.md with extensive tests).

What changed in Codex’s “agent” design, and why does it matter for real engineering work?

OpenAI describes the agent as two parts: a reasoning model and a tool harness. The reasoning side moved from “GPD5” shipped in August to “GPT-5 Codex,” optimized for work inside Codex—aimed at better code style adherence, more appropriate thinking time, and behavior that feels closer to a senior engineer (fewer compliments, more pushback on bad ideas). The tool harness was also rewritten to support planning, MCP, and features like auto context compaction, enabling longer multi-step interactions where the agent can coordinate actions and keep track of what it’s doing.

How does Codex’s availability across environments change day-to-day usage?

Codex is described as working everywhere developers build: IDEs, terminal, GitHub, web, and mobile, using the same agent under the hood. The CLI was revamped after early feedback (simplified approval modes, more legible UI, and sandboxing by default). For collaboration workflows, Codex is also delivered as a native IDE extension (works with VS Code and Cursor forks), bundling the same open-source harness that powers the CLI so users can see and control changes in their editor.

What does “verification” look like beyond unit tests in the iOS workflow?

Nacho Sto emphasized that Codex can verify visually, not just logically. The workflow uses SwiftUI preview extraction: a mech file runs unit tests to extract SwiftUI previews, calls a small Python script (written by Codex) to extract images into a folder, and then configures Codex to use those snapshots to validate UI changes. The loop runs until tests pass and the UI is “pixel perfect,” and the approach is positioned as scalable to larger projects and adaptable to web tooling like Storybook or Playwright.

How can Codex handle long, complex refactors without losing coherence?

Fel described a sustained workflow using structured planning artifacts. Codex is prompted to write a spec and then a living plans.md “design document for design documents,” including big-picture goals, a to-do list, progress updates, surprises/discoveries, and a decision log. The model is anchored to a unique term (“exec plan” / plans.md) so it knows to reflect back and update the plan as work proceeds. During execution, the user monitors tests; if the test signal stays red too long, intervention can occur. The demo referenced a JSON parser PR with over 15,000 lines of change produced over many hours of agent work, with tests including property tests, exhausted property tests, and fuzzing.

Why is code review treated as a first-class feature, and how does the review loop work?

OpenAI frames code review as a new bottleneck because faster code generation creates more code to validate. Earlier code-review attempts were noisy and sometimes disabled due to low signal. The solution is training GPT-5 Codex for ultra thorough code review: it inspects dependencies and code deeply in a container, exploring how intent maps to implementation, then returns high-quality findings—aimed at critical issues rather than dozens of minor notes. Teams can enable it by default, trigger it during pairing, or automate it on every GitHub PR. Daniel demonstrated local review via slash commands (e.g., “/ slash review”), including reviewing against a base branch and running a separate review thread to reduce implementation bias, followed by iterative fix → re-review until approval.

What adoption and productivity signals did OpenAI report internally?

OpenAI reported that 92% of technical staff use Codex daily, up from about 50% around last July. Engineers using Codex submit 70% more PRs per week, and “pretty much all” PRs are reviewed by Codex. The claimed effect is earlier bug detection, more confidence at release time, and positive reactions when Codex finds issues.

Review Questions

How do the reasoning model and tool harness changes work together to enable longer, more reliable agent sessions?
What mechanisms in the iOS workflow provide visual verification, and how are they integrated into the test loop?
Describe the role of plans.md in managing long refactors—what information does it contain and how does it keep the agent aligned?

Key Points

1
Codex’s agent stack was overhauled with GPT-5 Codex (optimized for Codex work) and a rewritten tool harness supporting planning, MCP, and auto context compaction.
2
Codex is now available across IDEs, terminals, GitHub, web, and mobile using the same underlying agent, with sandboxing enabled by default in the CLI.
3
IDE integration matters: Codex ships as a native extension (VS Code and Cursor forks) so developers can collaborate with the agent while viewing code in place.
4
Codex Cloud scales execution by running many tasks in parallel and enabling automated verification via dependency setup and screenshot-based checks.
5
Ultra thorough code review is treated as a core capability: GPT-5 Codex reviews deeply inside containers and can run during pairing or automatically on every GitHub PR.
6
Internal adoption is reported as broad (92% daily usage) with productivity gains (70% more PRs per week) and near-universal Codex review coverage.
7
Verification loops can be extended from unit tests to multimodal UI checks (screenshots) and to long-running refactors using living plans.md plus extensive testing (including fuzzing).

Highlights

GPT-5 Codex is positioned as a senior-engineer-like coder: better code-style adherence, more appropriate thinking time, and pushback on bad ideas.

Codex’s verification expands beyond tests—iOS UI work can be validated through screenshot snapshots derived from SwiftUI previews.

Long refactors are managed with a living plans.md “exec plan” that tracks progress, decisions, and surprises while the agent iterates and tests for hours.

Code review is automated and deep: GPT-5 Codex is trained to find high-signal, critical issues by exploring dependencies and implementation details inside containers.

Topics

Codex Agent
GPT-5 Codex
Tool Harness
Code Review
UI Verification

Mentioned

Nacho Sto
Fel
Daniel
IDE
CLI
MCP
TDD
PR
VS Code
MCP
GPT-5
GPT5
TDD
UI
JSON