OpenAI just destroyed all coding apps

TL;DR

Codex is a cloud-based coding agent that can answer code questions, execute code, run linting/tests, and draft GitHub pull requests asynchronously.

Briefing Cornell Notes

Briefing

OpenAI’s Codex is positioned as a cloud-based coding agent that can take real engineering tasks—answering code questions, running tests, and drafting pull requests—while working asynchronously in the background. The practical punchline from hands-on use is speed through delegation: instead of waiting on one change at a time, Codex can juggle many small-to-medium tasks concurrently, leaving developers to review and steer rather than manually implement every fix.

Access starts inside Chigpd (via a Codex button that appears after upgrading a plan). Codex runs in OpenAI’s cloud, not locally, and it can execute code, run linting and tests, and draft GitHub pull requests. After enabling 2FA, users connect a GitHub account, choose either a personal account or an organization, and then select a repository. A “create environment” step provisions a virtual environment on OpenAI’s cloud so Codex can use the repo, run tests, and debug remotely—even from a phone.

Codex also supports workflow control: users can decide whether OpenAI can use their code to train and improve models, and then kick off tasks. OpenAI’s suggested starting tasks include explaining a codebase structure to a newcomer, finding and fixing a bug in a chosen area, and scanning a codebase for issues. In practice, Codex can run multiple tasks at once, and the interface shows how many tasks are actively being worked on.

Benchmark comparisons are modest on raw accuracy, but the claimed advantage is operational: Codex one is reported as slightly more reliable than o3 high (75% vs 70% accuracy on internal software engineering tasks). The bigger shift comes from the interface and autonomy—users can launch dozens of tasks and let Codex work while they do other things.

The transcript’s production-style tests focus on a real startup codebase used by over 50,000 users. Codex first generates a quick repository overview (fast enough to be useful for onboarding). Then it tackles a concrete search-ranking issue where results sometimes appear inverted. Codex proposes a client-side sorting change based on an assumed numeric rank field, adds a test, and runs linting—yet the test fails due to an environment/setup problem (a missing command in the OpenAI environment). A second iteration improves the prompt and, after adding an agents.md system prompt at the repo root (modeled after cursor rules), Codex correctly identifies that the SQL function returns a computed rank field and produces a pull request clarifying how the rank is used.

Other tasks demonstrate dependency-aware edits: updating a chat input outline based on “agent mode” vs “chat mode” leads Codex to modify multiple TypeScript/TSX files so the new mode signal flows through the app. Even when linting fails, the changes are presented as structurally consistent, with the failure attributed again to environment limitations.

Codex’s “push” options let users create a new PR, create a draft PR, or copy a patch. The overall message is that coding is shifting toward task delegation: feed clear prompts, let Codex generate candidate changes asynchronously, and rely on human review for correctness—especially for anything beyond small-to-medium scope. Pricing is also part of the operational story: Codex is available for Pro ($200/month), Team (noted as $60/month per additional invite), and Enterprise, while Free and Plus users lack access as of the time of testing.

Cornell Notes

Codex is a cloud-based AI coding agent that can answer code questions, run tests, and draft GitHub pull requests asynchronously. In hands-on use, it handled multiple parallel tasks—like explaining a repo, investigating a search ranking issue, and updating UI behavior—then produced PRs for human review. Reported benchmark accuracy for Codex one is slightly higher than o3 high (75% vs 70%), but the bigger advantage is workflow: launching many tasks at once and letting them run while developers focus elsewhere. Results depend heavily on prompt clarity and on providing a repo-level agents.md system prompt to guide agent behavior. Environment/setup issues can still cause lint/test failures, so review and iteration remain essential.

How does Codex connect to a real GitHub repo and create an execution environment?

After enabling 2FA, users connect GitHub through an authorization step, then choose either a personal account or a company organization. They select a repository and click “create environment,” which provisions a virtual environment on OpenAI’s cloud. That environment lets Codex use the repo, write tests, and debug remotely; the transcript notes this can be done even from a phone. Codex then runs tasks against that cloud environment rather than locally on the user’s machine.

What does “asynchronous” mean in Codex’s workflow, and why does it matter?

Asynchronous work means Codex can start one task and continue processing others in the background at the same time. The interface shows how many tasks are actively running, and tasks finish every couple of minutes in the transcript’s examples. This enables launching many small-to-medium engineering requests—like bug fixes or UI tweaks—without blocking the user, who can review completed work as it arrives.

What role does agents.md play, and how did it change outcomes?

agents.md acts like a system prompt for the agent, guiding how Codex plans and behaves. In the transcript, Codex initially produced a flawed search-ranking change because it assumed a rank field existed. After adding agents.md at the repo root (mirroring optimized cursor rules), Codex later identified that the SQL function returns a computed rank field and produced a more accurate PR clarifying how rank should be used. The transcript notes that tasks already run won’t retroactively benefit, but new tasks will use agents.md.

Why did some tests fail even when the code changes looked reasonable?

The transcript attributes failures to the OpenAI cloud environment setup rather than the user’s code. For example, lint/test runs failed because a command was not found (next command missing), and the author suggests Codex’s environment lacked certain packages or configuration. The key takeaway is that Codex can still generate correct logic while the execution environment causes tooling errors, especially during research preview.

How does Codex handle multi-file, dependency-aware UI changes?

When asked to update the chat input outline based on “agent mode” vs “chat mode,” Codex didn’t only edit one UI file. It updated the relevant TSX component to apply a thin blue outline in chat mode, and it also modified additional files (including types) so the chat mode value is passed through correctly. The transcript highlights that this kind of change requires understanding dependencies, not just swapping a color in one place.

What mechanisms let users turn Codex output into GitHub changes?

Each completed task includes a “push” option. Users can create a new pull request, create a draft pull request, or copy the git apply patch. In the transcript, Codex outputs were pushed into GitHub PRs tagged to indicate they were created by Codex, and the user then reviewed and either accepted or rejected based on correctness.

Review Questions

What steps are required to connect Codex to GitHub and provision a cloud environment for running tests?
How does adding an agents.md file at the repo root affect Codex’s behavior on future tasks?
Give one example from the transcript where Codex’s code changes were plausible but lint/test execution failed—what was the stated cause?

Key Points

1
Codex is a cloud-based coding agent that can answer code questions, execute code, run linting/tests, and draft GitHub pull requests asynchronously.
2
Access to Codex is tied to plan level in the transcript: Pro and Team (and Enterprise) have it, while Free and Plus do not.
3
A typical setup flow includes enabling 2FA, connecting GitHub, selecting a repository, and creating an OpenAI cloud environment for remote execution.
4
Codex’s practical advantage is parallel task execution—launch many small-to-medium engineering tasks and review completed work as it finishes.
5
Prompt clarity and repo-level guidance (agents.md) materially affect results; vague instructions led to an incorrect rank-field assumption, while agents.md improved accuracy.
6
Execution failures can stem from the OpenAI cloud environment/tooling setup (e.g., missing commands), so review includes checking both code logic and test/lint context.
7
Codex outputs can be pushed as PRs, draft PRs, or patches, making human review the final gate for correctness.

Highlights

Codex one is reported at 75% accuracy on internal software engineering tasks versus 70% for o3 high, but the bigger shift is the interface that enables launching many tasks at once.

Adding agents.md at the repo root changed Codex’s search-ranking fix from an incorrect assumption to a PR aligned with the actual SQL-computed rank field.

Codex can make dependency-aware UI edits—updating multiple TSX/TypeScript files—rather than changing a single component in isolation.

Even when logic seems right, lint/test failures can occur due to missing commands or packages in the OpenAI cloud environment during research preview.

Topics

AI Coding Agent
Codex Setup
Asynchronous Tasks
Pull Request Automation
Prompt Engineering

Mentioned

David Ondrej
2FA
AGI
LLM
PR
SQL
TSX
VIP

OpenAI just destroyed all coding apps - Codex