ChatGPT o3: Model Breakdown vs. Gemini 2.5 Pro, o3 Work Skills, Plus AI Landscape Review post-o3

TL;DR

o3 is reported to be more complete across multiple job-skill-style tasks, especially when outputs must remain internally consistent and self-critical.

Briefing Cornell Notes

Briefing

OpenAI’s o3 is emerging as the more reliable “everyday” model after hands-on tests that target real job skills—especially tasks where models must stay internally consistent, critique their own work, and handle multimodal details. In side-by-side challenges against Gemini 2.5 Pro, o3 produced richer, more self-aware outputs and held up better when the tasks required readable text inside generated images and careful review/rebuttal across perspectives. The practical takeaway is that benchmark scores can miss the kinds of failure modes that matter at work: confident reasoning that quietly drifts, or multimodal outputs that look plausible while containing factual gaps.

The creator’s starting point wasn’t a new leaderboard—it was a personal sense that priors about what LLMs can do were shifting again. Within the first day of using o3, they noticed it could detect subtle patterns in meeting outcomes that resisted human pattern-finding, and it could help articulate value-proposition development more fluently—an area where traditional guidance often assumes expensive experimentation and doesn’t map cleanly onto today’s cheaper prototyping. That “sparring partner” feel became the backdrop for a more structured comparison.

To avoid overfitting to familiar benchmark styles, they designed three tests meant to be harder to game and easier to measure in a job-relevant way. The first, a “civilization simulator,” asked the model to build a fictional society from the stone age to space flight across 12 logical epics, generate primary artifacts and laws, and then critique its own narrative for plausibility issues like population size and resource distribution. Both models performed well here, but o3’s output was described as more layered and historically resonant, with a more honest self-critique.

The second test—the “multimodal mystery box”—was where the gap widened. The models had to write a mystery story, embed clues in the narrative, generate a custom image containing those clues, and allow a solver to infer the answer without an answer key. o3 succeeded by producing images with readable text and specific, checkable clue details (including a map with San Francisco circled in red grease pencil). Gemini 2.5 Pro, by contrast, was reported to generate images that looked fine at first glance but failed on verifiable specifics: it claimed readable text existed when it did not, and it asserted a clue about a clock that wasn’t actually drawn (the transcript notes the common “10:10” AI-clock trope as a tell).

The third challenge, a “peerreview gauntlet,” pushed meta-awareness: write a paper, review it from a different perspective, then rebut the reviewer as the author. Here, o3 was described as sharper and more data-obsessed, producing a plausible dataset, then tearing it apart under review, and rebutting effectively. The broader concern raised is that these strengths—persuasiveness, logic, and confidence—can make hallucinations harder to detect. As models get smarter, misalignment and fabrication can shift into forms that humans are less able to spot, especially when the model’s outputs are structured like real scholarship or analysis.

Still, the work-skill mapping is strong: long-horizon planning and narrative coherence from the civilization task, artifact-rich communication from the multimodal mystery, and debate/iteration from the peer-review format. The conclusion is straightforward: when available, o3 should be the first choice for everyday use, even if it isn’t perfect and may face constraints like token limits and server strain.

Cornell Notes

o3 outperformed Gemini 2.5 Pro in three custom, job-skill-aligned tests designed to reduce benchmark overfitting. In a civilization-building simulation with self-critique, both models did well, but o3 produced richer narratives and more candid plausibility checks. In a multimodal mystery challenge that required readable clue text inside generated images, o3 delivered verifiable details, while Gemini 2.5 Pro was reported to hallucinate image content—claiming text or objects existed when they did not. In a peer-review gauntlet (author → reviewer → rebuttal), o3 generated and stress-tested a plausible dataset more effectively. The stakes: stronger reasoning and confidence can make fabricated or incorrect details harder for humans to catch, even when outputs look rigorous.

Why did the comparison avoid standard benchmark scores?

The tests were designed to feel less like “known question types” that models can overfit to. The goal was to use prompts that would fail in ways that are measurable and comparable—so differences show up in concrete outputs rather than in leaderboard-style metrics.

What made the civilization simulator a meaningful work-skills test?

It required long-horizon planning (stone age to space flight across 12 epics), structured artifact creation (laws and primary artifacts), and then an explicit self-critique for plausibility. The self-review forced the model to check internal consistency—population size and resource distribution—rather than just generate a compelling story.

What exactly broke for Gemini in the multimodal mystery box?

Gemini was reported to produce images that looked convincing but contained false claims. Examples include asserting readable text existed when it didn’t, and claiming a clue about a clock with a particular setting even though no clock was drawn. o3, in contrast, generated an image with readable clue text (e.g., a map with San Francisco circled in red grease pencil) that matched the narrative.

How did the peer-review gauntlet demonstrate o3’s strengths?

The task required three roles in one flow: write a paper, review it from a different perspective, then rebut as the author. o3 produced a plausible dataset, then reviewed and tore it apart, and finally rebutted the reviewer—described as sharper and more mathematically/data-driven than Gemini’s output.

What risk does stronger performance create for real-world use?

As models become more persuasive and logically structured, hallucinations can become harder to detect. The concern is that fabrication may shift into forms that look aligned and rigorous—like peer-reviewed analysis or data-centric critique—where non-specialist human reviewers may not notice the underlying make-believe.

Review Questions

Which of the three tests most directly measured multimodal factuality, and what specific failure mode was reported for Gemini?
How did the civilization simulator’s self-critique requirement change what the model had to do beyond storytelling?
What makes hallucination risk potentially harder to spot as models become more confident and data-structured?

Key Points

1
o3 is reported to be more complete across multiple job-skill-style tasks, especially when outputs must remain internally consistent and self-critical.
2
A civilization-building simulation with self-critique showed both models could perform, but o3 produced richer narratives and more honest plausibility checks.
3
In a multimodal mystery challenge requiring readable clue text inside generated images, o3 produced verifiable details while Gemini 2.5 Pro was reported to hallucinate image content (e.g., claiming text or a clock existed when it did not).
4
A peer-review gauntlet (author → reviewer → rebuttal) highlighted o3’s data- and math-oriented rigor, including generating and stress-testing a plausible dataset.
5
Stronger reasoning and confidence can increase alignment/hallucination risk by making fabricated details harder for humans to detect.
6
The strongest practical mapping to work skills comes from long-horizon planning, artifact-rich communication, and iterative debate/review cycles.

Highlights

o3’s multimodal outputs included readable clue text that matched the narrative, while Gemini 2.5 Pro was reported to claim readable elements that weren’t actually present.

The multimodal mystery box served as a stress test for “looks right at first glance” failure—Gemini’s image claims didn’t survive verification.

o3 handled the peer-review gauntlet with a dataset-first approach: generate plausible data, critique it, then rebut the critique as the author.