GPT 5.2 and Image-gen-2 from Open AI - A final swing at Google?

TL;DR

GPT Image 2 (assumed “Hazel Gen 2”) can label many human-cell components correctly, but specific organelles (notably Golgi and lysosome placement) still show scientific inaccuracies.

Briefing Cornell Notes

Briefing

OpenAI’s latest push—GPT 5.2 plus an image model billed as “Image-gen-2”—is landing as a serious, if uneven, challenge to Google’s top generators. Early community tests suggest the image system can produce more scientifically accurate labeled diagrams than Google’s “Nano Banana Pro,” while GPT 5.2’s coding and 3D demos look competitive with Google’s best models—especially in how much complete functionality can be generated from code-only prompts.

On the image side, a human-cell diagram prompt (“Create a fully labeled diagram of a human cell with at least 10 elements”) becomes the main battleground. The diagram produced by “Hazel Gen 2” (widely assumed to be OpenAI’s GPT Image 2) gets several core labels right—human cell spelling, the plasma membrane placement and detail, and multiple organelles such as the cytoskeleton, centrosome, lysosome, mitochondria, Golgi apparatus, vesicles, ribosomes, and smooth/rough endoplasmic reticulum. But accuracy breaks down in specific scientific spots: the Golgi apparatus appears scrambled or mislocated, ribosomes are rendered as tiny dots, and some organelle associations look swapped or misplaced (including confusion between vesicles and mitochondria, and a lysosome that points toward the nucleus rather than its proper location). Even so, the tester’s bottom line is that Nano Banana Pro still wins in “scientific accuracy,” but by a smaller margin than expected.

A second prompt—a chai recipe with step-by-step instructions—shows fewer obvious errors and highlights a different kind of strength: GPT Image 2’s output includes more structured, step-aligned visuals and a more “textbook” feel, while Nano Banana Pro’s visuals lean more artistic and ingredient-focused. The comparison ultimately comes down to preference: one system is more semantically tidy and step-by-step, the other more visually stylized.

Then the focus shifts to GPT 5.2, released as the model’s rollout lands. In Element Arena demos, GPT 5.2 generates substantial 3D scenes and interactive projects from code prompts alone—no image inputs—ranging from a Golden Gate Bridge with adjustable weather, time of day, and traffic density to voxel foliage, animated fish, and even a rocket launch with particle effects. Benchmarks cited from LM Arena and AIM-style leaderboards place GPT 5.2 near the top: it’s described as improving over GPT 5.1, with “high thinking” variants trading off cost and accuracy against Claude 4.5 Opus and Gemini 3 Pro.

But practical tests also reveal rough edges. In a water-physics HTML challenge (interactive 3D with reflections and wave simulation), GPT 5.2 initially produces broken or incomplete code, then iterates toward a working version—though it introduces issues like “ghost lemons” and glitchy physics. In a physics-based jelly platformer, GPT 5.2 generates very large codebases (over a thousand lines) and playable-looking graphics, yet physics can become unbalanced or unplayable due to assumptions about frame rate and tuning. The overall verdict: GPT 5.2 is surprisingly competitive and strong as a starting point for real projects, while Google’s Gemini 3 Pro remains formidable—sometimes even better at producing tighter, more immediately functional results in specific physics tasks.

Cornell Notes

OpenAI’s GPT 5.2 and “Image-gen-2” are being tested against Google’s Nano Banana Pro and Gemini 3 Pro across two fronts: labeled scientific diagrams and code-generated interactive projects. In a human-cell diagram task, GPT Image 2 (assumed to be “Hazel Gen 2”) gets many major labels right but still misplaces or scrambles some organelles, making Nano Banana Pro slightly ahead on strict scientific accuracy. For chai recipe generation, both systems perform well, with differences mainly in how visuals are organized and how “textbook” versus “artistic” the presentation feels. GPT 5.2’s code-only 3D demos look highly competitive, but hands-on physics tests show it can produce broken code or “ghost” glitches and physics tuning issues that require iteration.

How did GPT Image 2 perform on the human-cell diagram compared with Nano Banana Pro?

The human-cell prompt produced a diagram with correct spelling (“human cell”) and strong placement/detail for the plasma membrane. Several organelles were labeled plausibly (cytoskeleton, centrosome, mitochondria, Golgi apparatus, vesicles, ribosomes, and smooth/rough endoplasmic reticulum). Accuracy failed in specific scientific areas: the Golgi apparatus looked scrambled or misdirected, ribosomes were rendered as near-dot artifacts, and there were swaps/misassociations such as vesicles being mistaken for mitochondria and lysosome placement drifting toward the nucleus. The tester’s conclusion was that Nano Banana Pro still edges out GPT Image 2 on scientific correctness, but the gap was smaller than expected.

What did the chai recipe comparison reveal about each image model’s strengths?

Both models handled the chai recipe without major glaring mistakes, but they differed in presentation. GPT Image 2’s output was more “step-by-step” and structured, with visuals tied to the process and ingredients, while Nano Banana Pro’s visuals were more artistic and ingredient-forward. The tester also noted minor semantic issues (for example, whether the instructions say to boil longer versus simmering details), but overall both recipes were judged sound. The final call was user preference: which format feels more useful—structured instructional visuals or a more artistic layout.

What makes GPT 5.2 stand out in the code-and-3D demos?

Element Arena demos emphasize code-only generation: GPT 5.2 is prompted to create 3D scenes (e.g., Golden Gate Bridge) without being given images, then adds parameters like weather, time of day, and traffic density. The demos also show particle effects and reflections/waves in water-like settings, plus cohesive multi-part projects (voxels, HTML, animations). The tester interprets this as a meaningful step up from earlier versions, describing it as more “graduate level” coding because it produces larger, integrated projects rather than isolated snippets.

Why did GPT 5.2 struggle in the water-physics HTML test?

In the water-physics prompt, GPT 5.2 initially generated code that didn’t work correctly in CodePen, then attempted to rectify issues by making assumptions—at one point even treating the user’s CodePen loading as the problem rather than fixing its own code. Later iterations improved functionality (lemons could be dropped and reflections appeared), but introduced new problems like “ghost lemons,” floating/glitchy behavior, and water rendering that felt “mystical” or incomplete. The tester also notes that Gemini 3 Pro sometimes produced a more reliable result with fewer lines and better two-shot completion.

How did GPT 5.2 perform on the physics jelly platformer, and what went wrong?

GPT 5.2 produced a large codebase (over a thousand lines) with graphics, music, and a working-looking physics platformer prototype. However, physics tuning was unstable: jumping/flinging strength was too high, causing the jelly character to teleport or deform and making levels difficult or unplayable. The tester suspects the model assumed a particular frame rate (e.g., 60 FPS), so physics ran too fast on the tester’s hardware. Even when the game looked impressive, completion required fixes to physics parameters and cross-hardware robustness.

Review Questions

In the human-cell diagram test, which organelles were most likely to be mislocated or misidentified, and why does that matter for “scientific accuracy”?
What differences in output format (step-aligned visuals vs more artistic ingredient/process visuals) influenced the chai recipe comparison?
In the water-physics and jelly-game tests, what kinds of failures appeared (broken code, physics instability, frame-rate assumptions), and how would you design prompts or validation steps to reduce them?

Key Points

1
GPT Image 2 (assumed “Hazel Gen 2”) can label many human-cell components correctly, but specific organelles (notably Golgi and lysosome placement) still show scientific inaccuracies.
2
Nano Banana Pro edges GPT Image 2 on strict scientific correctness in the human-cell diagram, though the gap appears smaller than expected.
3
Chai recipe generation works well for both systems; the main differences are how visuals map to steps and ingredients and how “textbook” versus “artistic” the presentation feels.
4
GPT 5.2’s strongest early signal is code-only generation of substantial 3D projects with adjustable parameters (weather, time of day, traffic) and effects like reflections and particles.
5
Hands-on physics coding reveals reliability gaps: GPT 5.2 can output broken or incomplete CodePen code and may introduce glitches such as “ghost” objects.
6
Physics games can fail due to tuning assumptions (like frame rate) and overly strong mechanics, even when graphics and overall structure look impressive.
7
Benchmark chatter places GPT 5.2 near the top against Gemini 3 Pro and Claude 4.5 Opus, with tradeoffs between accuracy (“high thinking”) and cost per task.

Highlights

In labeled cell diagrams, GPT Image 2 gets many major structures right but still misplaces or scrambles certain organelles, keeping Nano Banana Pro slightly ahead on scientific accuracy.

GPT 5.2’s code-only 3D demos show unusually cohesive projects—Golden Gate Bridge scenes with weather/time/traffic controls and effects like reflections and waves.

Physics tests are where reliability shows: GPT 5.2 can produce “ghost lemons” and unstable jelly-game physics, suggesting frame-rate and tuning assumptions need correction.

The emerging pattern: GPT 5.2 is highly competitive as a starting point for building interactive projects, but it still needs iteration to reach “works everywhere” behavior.

Topics

GPT 5.2
Image-Gen 2
Nano Banana Pro
Code-Only 3D
Physics Coding

Mentioned

API
LM Arena
ELO
FPS
HTML