Open AI SHIPS! o1 FULL & ChatGPT Pro

TL;DR

OpenAI reports major gains for o1 in competition math (56.7 → 83.3) and competition code (62 → 89) after moving from preview to full release.

Briefing Cornell Notes

Briefing

OpenAI’s o1 model is out of preview—and the release is positioned as a clear step up in reasoning benchmarks, especially for math and coding—while also finally delivering multimodal input support that was missing from the earlier preview. In a short launch livestream, OpenAI reported that o1’s performance jumps from 56.7 (o1 preview) to 83.3 on competition math, and from 62 (o1 preview) to 89 on competition code. For PhD-level science questions, accuracy lands around the same level as the preview (roughly 78–79), but the error-rate and “fewer-shot” behavior are presented as improvements, suggesting the model is getting to correct answers more reliably even when the evaluation is stricter.

Alongside the base o1 release, OpenAI introduced “o1 Pro mode” inside a new ChatGPT Pro subscription priced at $200 per month. Pro mode is described as using more compute for harder questions, with reported gains in competition math (about 85) and PhD-level science (about 79). OpenAI also highlighted a “worst of four” benchmark approach: the model attempts each question four times, and a problem counts as solved only if all four attempts are correct—an attempt to reduce cherry-picking and show consistency rather than one-off wins.

The demos leaned on live, human-verifiable tasks. One example involved an AST physics problem written on paper with a deliberately omitted parameter; o1 not only solved the math but also inferred the missing value, with researchers claiming the inferred parameter was accurate. Another demo tested biological reasoning by asking the model to guess a protein matching constraints like length (210–230 amino acid residues) and chromosome context; the model took about a minute and produced a correct guess.

The transcript then shifts from launch claims to hands-on impressions. Multimodal capability is treated as the biggest practical upgrade: the tester ran a “Where’s Waldo” image search and reported that o1 located Waldo precisely and described nearby visual structures (tents, umbrellas, shoreline crowding) in a way that regular GPT-4o was said to lack. The tester also pushed o1 into coding workflows: recreating a Spotify-like interface in HTML/CSS/JS from a screenshot, generating interactive hover/selection behavior, and producing an infinite-runner game with graphics and scoring. A further step converted the runner into a 3D scene using three.js, with the caveat that the game had minor issues after refresh and lacked a restart button.

Logic and real-world reasoning tests were mixed but generally positive. A classic “marble in an upside-down glass” puzzle was answered correctly with step-by-step reasoning, while a North Pole navigation puzzle—previously failing on o1 preview—was reported as solved, with the answer argued to be 2π kilometers. Other tests included identifying an engine bay as a mid-2000s Volkswagen TDI-family engine and a GeoGuessr-style location guess that landed close (Uruguay rather than the initially inferred Argentina).

The central takeaway is that o1’s release combines stronger benchmark performance with multimodal input and more capable coding-from-images, while ChatGPT Pro’s $200 price tag makes the compute-heavy “o1 Pro mode” a premium product that may only be worth it for heavy users. The transcript ends with an expectation of continued testing during OpenAI’s “12 days of shipping” run-up to Christmas.

Cornell Notes

OpenAI’s o1 model has moved from preview to full release, with reported gains in competition math (83.3) and competition code (89), plus multimodal input support that earlier o1 preview lacked. For PhD-level science questions, accuracy is roughly similar to preview (around 78–79), but the evaluation framing suggests improved consistency and fewer-shot behavior. OpenAI also introduced o1 Pro mode inside a $200/month ChatGPT Pro plan, using more compute for harder tasks and emphasizing “worst of four” consistency benchmarks. Hands-on testing in the transcript highlights strong image understanding (e.g., locating Waldo) and strong coding-from-screenshots (HTML/CSS/JS recreation, interactive UI, and game generation).

What benchmark changes are most emphasized for o1’s full release versus o1 preview?

Competition math rises from 56.7 (o1 preview) to 83.3 on release. Competition code increases from 62 (o1 preview) to 89. For PhD-level science questions, accuracy is reported around 78–79 for o1, roughly matching o1 preview, but with discussion of error-rate/fewer-shot differences.

Why does “worst of four” matter in the Pro mode discussion?

The “worst of four” method attempts each question four times and counts it as solved only if all four attempts are correct. That makes the metric harder to “luck into” and is meant to demonstrate consistency rather than one successful run.

What does multimodality change in practical use, based on the transcript’s tests?

Multimodality enables the model to interpret images directly. In the tester’s “Where’s Waldo” experiment, o1 reportedly identified Waldo’s exact location and described nearby visual cues (beach tents, umbrellas, crowding near the shoreline). The transcript contrasts this with regular GPT-4o, which is said to be less precise about exact positioning.

How did o1 perform on coding tasks generated from screenshots?

The tester sent a screenshot of a Spotify-like interface and asked for HTML/X. o1 produced working HTML/CSS/JS code on the first try, including interactive hover/selection behavior. It also generated an infinite-runner game and then converted it to a three.js 3D version, with minor issues like lack of restart and some breakage after refresh.

What logic puzzles were used to probe reasoning reliability?

A marble-and-upside-down-glass puzzle was answered correctly with step-by-step reasoning: the marble stays on the table because only the glass is moved. A North Pole navigation puzzle—previously failing on o1 preview—was reported as solved, with the transcript arguing the answer is 2π kilometers.

Review Questions

Which reported benchmark improvements are largest for o1 release over o1 preview, and which area stays roughly flat?
How does the “worst of four” evaluation design reduce the chance of misleading results?
In the transcript’s hands-on tests, what kinds of tasks most clearly benefit from multimodal input?

Key Points

1
OpenAI reports major gains for o1 in competition math (56.7 → 83.3) and competition code (62 → 89) after moving from preview to full release.
2
o1’s full release adds multimodal input support, which the transcript treats as a key missing capability in o1 preview.
3
PhD-level science accuracy for o1 is presented as roughly similar to preview (about 78–79), with emphasis on consistency/error-rate behavior rather than a dramatic accuracy jump.
4
ChatGPT Pro is priced at $200/month and includes unlimited access to o1, o1 mini, voice mode, and o1 Pro mode, with o1 Pro mode using more compute.
5
OpenAI’s “worst of four” benchmark counts a question as solved only if all four attempts are correct, aiming to measure reliability.
6
Hands-on tests highlight strong image understanding (precise Waldo localization) and strong coding-from-images (HTML/CSS/JS UI recreation and game generation).
7
Premium pricing raises practical questions about value, especially since o1 Pro mode is not available via the API in the transcript’s account.

Highlights

o1’s competition math score jumps from 56.7 (preview) to 83.3 on release, while competition code rises from 62 to 89.

o1 finally supports multimodal inputs, and the transcript’s Waldo test claims precise location identification from an image.

The $200/month ChatGPT Pro plan bundles unlimited access to o1 Pro mode, framed around higher-compute accuracy and “worst of four” consistency.

Coding demos include generating interactive HTML/CSS/JS from a Spotify-like screenshot and converting an infinite-runner game into three.js 3D.

Topics

OpenAI o1 Release
o1 Pro Mode
ChatGPT Pro Pricing
Multimodal Image Understanding
Coding From Screenshots

Mentioned

Matt Burman

Open AI SHIPS! o1 FULL & ChatGPT Pro - First Impressions