Open AI SHIPS! o1 FULL & ChatGPT Pro - First Impressions
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI reports major gains for o1 in competition math (56.7 → 83.3) and competition code (62 → 89) after moving from preview to full release.
Briefing
OpenAI’s o1 model is out of preview—and the release is positioned as a clear step up in reasoning benchmarks, especially for math and coding—while also finally delivering multimodal input support that was missing from the earlier preview. In a short launch livestream, OpenAI reported that o1’s performance jumps from 56.7 (o1 preview) to 83.3 on competition math, and from 62 (o1 preview) to 89 on competition code. For PhD-level science questions, accuracy lands around the same level as the preview (roughly 78–79), but the error-rate and “fewer-shot” behavior are presented as improvements, suggesting the model is getting to correct answers more reliably even when the evaluation is stricter.
Alongside the base o1 release, OpenAI introduced “o1 Pro mode” inside a new ChatGPT Pro subscription priced at $200 per month. Pro mode is described as using more compute for harder questions, with reported gains in competition math (about 85) and PhD-level science (about 79). OpenAI also highlighted a “worst of four” benchmark approach: the model attempts each question four times, and a problem counts as solved only if all four attempts are correct—an attempt to reduce cherry-picking and show consistency rather than one-off wins.
The demos leaned on live, human-verifiable tasks. One example involved an AST physics problem written on paper with a deliberately omitted parameter; o1 not only solved the math but also inferred the missing value, with researchers claiming the inferred parameter was accurate. Another demo tested biological reasoning by asking the model to guess a protein matching constraints like length (210–230 amino acid residues) and chromosome context; the model took about a minute and produced a correct guess.
The transcript then shifts from launch claims to hands-on impressions. Multimodal capability is treated as the biggest practical upgrade: the tester ran a “Where’s Waldo” image search and reported that o1 located Waldo precisely and described nearby visual structures (tents, umbrellas, shoreline crowding) in a way that regular GPT-4o was said to lack. The tester also pushed o1 into coding workflows: recreating a Spotify-like interface in HTML/CSS/JS from a screenshot, generating interactive hover/selection behavior, and producing an infinite-runner game with graphics and scoring. A further step converted the runner into a 3D scene using three.js, with the caveat that the game had minor issues after refresh and lacked a restart button.
Logic and real-world reasoning tests were mixed but generally positive. A classic “marble in an upside-down glass” puzzle was answered correctly with step-by-step reasoning, while a North Pole navigation puzzle—previously failing on o1 preview—was reported as solved, with the answer argued to be 2π kilometers. Other tests included identifying an engine bay as a mid-2000s Volkswagen TDI-family engine and a GeoGuessr-style location guess that landed close (Uruguay rather than the initially inferred Argentina).
The central takeaway is that o1’s release combines stronger benchmark performance with multimodal input and more capable coding-from-images, while ChatGPT Pro’s $200 price tag makes the compute-heavy “o1 Pro mode” a premium product that may only be worth it for heavy users. The transcript ends with an expectation of continued testing during OpenAI’s “12 days of shipping” run-up to Christmas.
Cornell Notes
OpenAI’s o1 model has moved from preview to full release, with reported gains in competition math (83.3) and competition code (89), plus multimodal input support that earlier o1 preview lacked. For PhD-level science questions, accuracy is roughly similar to preview (around 78–79), but the evaluation framing suggests improved consistency and fewer-shot behavior. OpenAI also introduced o1 Pro mode inside a $200/month ChatGPT Pro plan, using more compute for harder tasks and emphasizing “worst of four” consistency benchmarks. Hands-on testing in the transcript highlights strong image understanding (e.g., locating Waldo) and strong coding-from-screenshots (HTML/CSS/JS recreation, interactive UI, and game generation).
What benchmark changes are most emphasized for o1’s full release versus o1 preview?
Why does “worst of four” matter in the Pro mode discussion?
What does multimodality change in practical use, based on the transcript’s tests?
How did o1 perform on coding tasks generated from screenshots?
What logic puzzles were used to probe reasoning reliability?
Review Questions
- Which reported benchmark improvements are largest for o1 release over o1 preview, and which area stays roughly flat?
- How does the “worst of four” evaluation design reduce the chance of misleading results?
- In the transcript’s hands-on tests, what kinds of tasks most clearly benefit from multimodal input?
Key Points
- 1
OpenAI reports major gains for o1 in competition math (56.7 → 83.3) and competition code (62 → 89) after moving from preview to full release.
- 2
o1’s full release adds multimodal input support, which the transcript treats as a key missing capability in o1 preview.
- 3
PhD-level science accuracy for o1 is presented as roughly similar to preview (about 78–79), with emphasis on consistency/error-rate behavior rather than a dramatic accuracy jump.
- 4
ChatGPT Pro is priced at $200/month and includes unlimited access to o1, o1 mini, voice mode, and o1 Pro mode, with o1 Pro mode using more compute.
- 5
OpenAI’s “worst of four” benchmark counts a question as solved only if all four attempts are correct, aiming to measure reliability.
- 6
Hands-on tests highlight strong image understanding (precise Waldo localization) and strong coding-from-images (HTML/CSS/JS UI recreation and game generation).
- 7
Premium pricing raises practical questions about value, especially since o1 Pro mode is not available via the API in the transcript’s account.