GPT 5.4 Pro Is the STRONGEST AI Model I’ve Tested (But Costs a TON)

TL;DR

GPT 5.4 Pro is strongest in agentic, code-driven tasks that produce working software artifacts, not just text.

Briefing Cornell Notes

Briefing

GPT 5.4 Pro is being positioned as the strongest “agentic” AI model tested so far—capable of building and modifying real, playable software artifacts end-to-end—but it comes with a steep price tag that makes it a tool for only the hardest jobs. In hands-on demos, GPT 5.4 Pro repeatedly generated complex 3D/interactive outputs by writing code, running multi-step tasks, and iterating until the result works—most notably when it autonomously edited a Pokémon Red ROM to swap in AI “starter” characters and even produced a functioning Minecraft-like world from scratch.

The most striking proof of autonomy came from a Pokémon Red ROM modification performed by GPT 5.4. The game booted normally, followed the usual start flow, and presented modded content directly in gameplay—down to selecting starters and battling with moves that reflected the chosen AI persona. The workflow wasn’t just text generation; it involved editing a real ROM and producing a playable artifact that launched like a standard game.

Other demos reinforced the same theme: long, careful “thinking” paired with coding and agentic execution. A Minecraft clone attempt produced a procedural world with multiple block types, a house, and environmental cues like a setting sun, nighttime, and clouds. A separate front-end comparison showed that enabling “skills” changed the look and usability tradeoffs: without skills, the interface was more visually bold, while with skills it became more structured and easier to parse at a glance.

Where GPT 5.4 Pro’s strengths showed most clearly was in difficult technical tasks that required both accuracy and interactive behavior. In a 3D engine visualization prompt (including X-ray views of pistons and the ability to start/rev/stop), GPT 5.4 initially failed to render the engine in one standard configuration, while Gemini 3.1 Pro succeeded on the first try. But after “anti-gravity” bug-fixing, GPT 5.4 Pro produced a far more detailed engine model—complete with moving pistons, throttle behavior, exhaust heat/glow, and an educational mode showing gas flow—though it still had some positioning errors (e.g., incorrect valve placement).

GPT 5.4 Pro also delivered a “production-ready” instrument pack: it recreated 18 classic band instruments using code and agentic research, then packaged the output into a ~60 MB zip with spectrogram visualizations and playable audio. The build time was about 65 minutes and 18 seconds, and the model was described as consistently willing to spend over an hour on complex tasks.

In broader comparisons, Gemini 3.1 Pro and Claude Opus 4.6 remained strong competitors—especially for efficiency and certain multimodal tasks. A multimodality test using a distorted Family Guy image showed GPT 5.4 “thinking” misidentifying characters (Batman/Snoopy instead of Peter Griffin/Brian), while Gemini handled it better. For 3D water simulation on a rotatable globe with lemon-tree spawning, all three models worked on first try, but Claude’s variant was judged to have the most convincing physics detail, with GPT 5.4 Pro close behind.

The overall takeaway: GPT 5.4 Pro is “undefeated” for the most demanding, code-heavy, agentic work, but its API cost is described as astronomically higher than other options. The practical recommendation was to use GPT 5.4 Pro when the task is too complex for faster or cheaper models, while leaning on Gemini or Claude for multimodal speed/quality and on open-source alternatives (notably Qwen 3.5) for local use and cost control.

Cornell Notes

GPT 5.4 Pro is presented as the strongest model tested for difficult, code-heavy, agentic tasks—especially when the goal is to generate working software artifacts rather than just text. Demos include autonomously modifying a Pokémon Red ROM into a playable game, building a Minecraft-like procedural world, and creating a detailed 3D engine simulator with working controls after bug-fixing. It also generated a “production-ready” pack of 18 instruments using code and research, taking about 65 minutes to complete. The tradeoff is cost: the Pro checkpoint is described as far more expensive and less efficient than Gemini 3.1 Pro and Claude Opus 4.6, and it can still struggle on certain multimodal identification tests where Gemini performs better.

What makes GPT 5.4 Pro feel “agentic” in these tests, beyond generating text?

It repeatedly produces functioning, interactive outputs by writing code and performing multi-step execution. The clearest example is a Pokémon Red ROM modification: the game boots normally, runs the standard start flow, and includes modded AI “starter” characters and battle behavior. Similar agentic behavior appears in the Minecraft-like attempt (procedural world generation plus environmental states like day/night) and in the 3D engine simulator, where the model not only renders parts but also supports starting/revving/stopping and an X-ray educational mode. When initial output is incomplete, an autonomous bug-fixing pass (“anti-gravity”) is used to repair the result.

How did GPT 5.4 Pro perform on the 3D engine visualization task compared with Gemini 3.1 Pro and Claude Opus 4.6?

In one standard configuration, GPT 5.4 failed to show the engine on the first shot (the engine seemed to exist in code but wasn’t rendered). Gemini 3.1 Pro succeeded immediately, showing piston motion and allowing engine start/rev. Claude Opus 4.6 (with no extended thinking) also had a similar issue where the engine wasn’t visible. After GPT 5.4’s bug-fixing step, it produced a much more detailed engine: moving pistons, throttle behavior, exhaust heat glow, and additional educational visualization—though it still had errors like incorrect valve positioning and imperfect piping alignment.

What does the instrument-pack demo suggest about GPT 5.4 Pro’s ability to do long, structured work?

It suggests the model can sustain long “thinking” and produce packaged deliverables. The prompt asked for 18 classic band instruments; GPT 5.4 Pro used code and agentic abilities to generate a production-ready pack, taking 65 minutes and 18 seconds. The output included a ~60 MB zip and a spectrogram grid image for the instruments, plus a readme-style explanation intended for a language model. The demo also included listening to generated audio for multiple instruments, with some sounding pleasant and at least one failing (the cymbal).

Why did the front-end “skills enabled vs disabled” comparison matter?

It highlighted a usability/appearance tradeoff tied to whether the model uses “skills.” Without skills, the interface was described as basic but visually bold and easier to glance at for some styles. With skills enabled, the interface became more structured and logically organized, with a side menu and clearer controls, but the demo suggested the more playful visual style diminished. The key point was that enabling skills improved pattern adherence and clarity, even if it changed the aesthetic.

How did multimodality tests affect the ranking among GPT 5.4, Gemini 3.1 Pro, and Claude Opus 4.6?

Multimodality was a deciding factor in at least one image-identification test. A distorted Family Guy screenshot was misread by GPT 5.4 “thinking,” identifying characters as Batman and Snoopy instead of Peter Griffin and Brian. Gemini 3.1 Pro was described as handling such distorted multimodal inputs better, while Claude Opus 4.6 was not treated as a top multimodal contender in this comparison.

What were the main tradeoffs when comparing GPT 5.4 Pro to Gemini 3.1 Pro and Claude Opus 4.6?

GPT 5.4 Pro was judged strongest for the most demanding, agentic coding tasks, but it was also described as astronomically more expensive and less efficient than Gemini 3.1 Pro and Claude Opus 4.6. For efficiency and certain categories—especially multimodal tasks—Gemini and Claude were treated as better fits. For 3D water simulation on a globe with lemon-tree spawning, all three worked on first try, but Claude’s variant was judged to have the most convincing physics detail, with GPT 5.4 Pro close behind.

Review Questions

In which demos did GPT 5.4 Pro produce a working artifact on the first attempt versus requiring bug-fixing to reach a functional result?
What specific multimodal failure was highlighted for GPT 5.4 “thinking,” and which model performed better on that same task?
How do the instrument-pack and driving-game demos illustrate the relationship between long “thinking” time, coding, and output quality?

Key Points

1
GPT 5.4 Pro is strongest in agentic, code-driven tasks that produce working software artifacts, not just text.
2
Autonomous ROM editing can yield a playable Pokémon game with modded starters and battle behavior.
3
Long “thinking” plus coding enables complex deliverables like a ~60 MB instrument pack generated from scratch in about 65 minutes.
4
GPT 5.4 Pro can fail on first render in technical 3D tasks, but bug-fixing passes can unlock much more detailed results.
5
Gemini 3.1 Pro showed better performance on at least one distorted-image multimodality identification test.
6
Claude Opus 4.6 was judged to lead in certain physics-heavy simulations (e.g., water wave realism) despite GPT 5.4 Pro being close.
7
The biggest practical limitation is cost: GPT 5.4 Pro’s API pricing is described as far higher than competing checkpoints, making it best reserved for the hardest jobs.

Highlights

GPT 5.4 Pro modified a real Pokémon Red ROM autonomously—booting and running like a standard game with modded content.

After bug-fixing, GPT 5.4 Pro produced a highly detailed 3D engine simulator with moving pistons, throttle behavior, and an X-ray educational mode.

A 65-minute 18-instrument “production-ready” pack was generated using code and research, delivered as a ~60 MB zip with spectrogram visuals.

Multimodal identification on a distorted Family Guy image tripped GPT 5.4 “thinking,” while Gemini handled it better.

GPT 5.4 Pro’s advantage is capability; its drawback is “astronomically high” cost and lower efficiency than Gemini 3.1 Pro and Claude Opus 4.6.

Topics

Agentic Coding
3D Simulation
Multimodality
Model Cost
Interactive Web Apps