Open AI SHIPS: "GPT o1" First Look! ("Strawberry" Chain of Thought Reasoning)

TL;DR

OpenAI’s o1 models (“01 preview” and “01 mini”) are positioned as reasoning-first systems trained with reinforcement learning and longer internal chain-of-thought.

Briefing Cornell Notes

Briefing

OpenAI has released a new reasoning-focused model family, “o1,” built around the rumored “Strawberry” chain-of-thought style approach. For ChatGPT Plus users, the models appear as “01 preview” and “01 mini,” with “01 preview” positioned for the most advanced reasoning. Early hands-on testing suggests the model can solve certain multi-step logic and reasoning tasks more reliably than GPT-4o, but it still stumbles when details are easy to misread or when the user doesn’t force the model to verify assumptions.

A key demonstration involved a classic “RS in strawberry” counting prompt. GPT-4o sometimes got it wrong, while o1 preview handled it correctly. The more revealing test came from a physics-like scenario: water is altered over a day (ice cubes formed, then moved into an inverted glass, a silver bead added, the glass flipped onto a table, and then placed in a microwave). The expected outcome was that the ice and bead end up on the counter (not trapped under the inverted glass in the microwave), while the remaining teaspoon of water evaporates. GPT-4o produced an incorrect or inconsistent interpretation at first, and o1 preview initially also missed the crucial placement detail. However, with additional prompting—explicitly asking the model to clarify whether the ice is actually in the microwave and to reason through the sequence—o1 preview eventually converged on the intended answer. The takeaway from this test is less about raw intelligence and more about workflow: o1 seems “prompt heavy,” rewarding users who demand step-by-step verification of physical states.

That behavior aligns with OpenAI’s own positioning in the accompanying blog post: o1 is trained with reinforcement learning to improve complex reasoning, including longer internal chain-of-thought before responding. The model’s performance is reported as strong across competitive programming, math, and physics benchmarks, with results improving as training time and “thinking” time increase. The transcript also notes that “full” o1 scores higher than “preview,” and that GPT-4o still trails behind on several reasoning benchmarks—though some categories show smaller gains.

Beyond benchmarks, the hands-on segment highlights practical prompting strategies. Users are encouraged to be specific, request detailed explanations, use organizational structure, and promote verification—essentially turning the model into a self-checking problem solver. In a bedroom reorganization task, o1 preview produced a structured, stepwise plan (laundry first, then bed, surfaces, electronics, floor items, and final inspection). In contrast, o1 mini responded more briefly and sometimes failed to deliver a complete answer in the same way.

Community reactions in the transcript range from excitement about “shipped” reasoning capability to caution about hype. Several commenters emphasize that o1 is not a miracle model that beats everything at once; it shines on hard reasoning tasks and improves with better prompting. Others predict that open-source and competing models will catch up quickly, especially as “reasoning” architectures become more common.

Cornell Notes

OpenAI’s o1 models (“01 preview” and “01 mini”) bring a reinforcement-learning approach aimed at complex reasoning, often using longer internal chain-of-thought before answering. Early testing suggests o1 can outperform GPT-4o on certain logic and reasoning problems, but it may still miss key physical details unless users prompt it to verify assumptions. A microwave-and-inverted-glass physics scenario showed that o1 preview could reach the correct outcome after targeted follow-ups like confirming where the ice actually is during the microwave step. The blog-linked claims tie performance gains to more reinforcement training and more “thinking” time, and community reactions stress that results depend heavily on prompting quality.

What does “o1” claim to do differently from earlier LLMs?

o1 is described as a new large language model trained with reinforcement learning to perform complex reasoning. The model is positioned as “thinking before answering,” with the ability to produce a longer internal chain-of-thought. Reported performance improves with more reinforcement learning time and with more time spent thinking at inference.

Why did the microwave-and-ice logic test require extra prompting?

The scenario hinged on a subtle physical detail: whether the ice cubes and silver bead remain trapped under the inverted glass when the glass is placed into the microwave. Initial answers treated the setup as if the ice stayed in the microwave area, but follow-up prompts that forced clarification—explicitly asking whether the ice is in the microwave—pushed the model to re-evaluate the sequence and converge on the expected outcome.

How did o1 perform on the “RS in strawberry” counting task compared with GPT-4o?

In the counting test, o1 preview returned the correct result (three letters “R” in “strawberry”). GPT-4o was described as inconsistent on similar prompts—sometimes correct, sometimes wrong—highlighting that o1’s reasoning behavior can improve reliability on straightforward counting/constraint problems.

What prompting pattern seemed to help o1 solve harder tasks more reliably?

The transcript repeatedly points to making the model verify its own assumptions: use a pre-prompt that asks it to think like a person running through actions, request detailed step-by-step reasoning, and explicitly ask it to confirm uncertain details before concluding. The model also appears to benefit from prompts that encourage self-evaluation and verification rather than jumping to an answer.

How did o1 preview and o1 mini differ in the bedroom organization and other tasks?

o1 preview produced a more structured, multi-step plan for organizing a cluttered bedroom (laundry first, then bed, surfaces/electronics, floor items, and final inspection). o1 mini often thought for fewer seconds and sometimes returned a worse or incomplete response, suggesting a tradeoff between reasoning depth and responsiveness.

What do the reported benchmarks and community reactions imply about real-world usefulness?

Benchmarks cited in the transcript show o1 improving across math, competitive programming, and physics-oriented evaluations, with stronger results for the full model than for preview. Community commentary adds a practical caveat: even with strong benchmarks, users may not notice improvements unless they prompt well, and o1 is not expected to be universally superior on every task.

Review Questions

In the microwave-and-inverted-glass scenario, what specific detail had to be clarified for o1 preview to reach the correct conclusion?
How do reinforcement learning training time and inference “thinking time” relate to the performance claims made for o1?
What prompting techniques in the transcript appear to increase the odds of correct reasoning on multi-step problems?

Key Points

1
OpenAI’s o1 models (“01 preview” and “01 mini”) are positioned as reasoning-first systems trained with reinforcement learning and longer internal chain-of-thought.
2
Hands-on tests suggest o1 can outperform GPT-4o on logic tasks, but it may still fail when physical or procedural details are ambiguous.
3
The microwave-and-ice example shows o1 often needs verification prompts that force re-checking where objects end up after each step.
4
o1’s performance is reported to improve with more reinforcement learning compute and with more time spent “thinking” during inference.
5
Prompting quality matters: being specific, requesting structured reasoning, and asking for verification can materially change outcomes.
6
o1 preview tends to produce more complete, structured answers than o1 mini, which may think less and sometimes underperform.
7
Community reactions emphasize both excitement about shipped reasoning capability and caution against expecting a universal “miracle” model.

Highlights

o1 preview reached the correct physics outcome only after prompts forced it to confirm whether the ice cubes were actually in the microwave step.

The “prompt heavy” pattern showed up repeatedly: demanding clarification and verification improved results more than simply asking the question once.

OpenAI’s blog-linked framing ties gains to reinforcement learning and to spending more time thinking before answering.

Community sentiment split between “finally shipped” excitement and reminders that better prompting may be required to notice improvements.

Topics

OpenAI o1
Strawberry Reasoning
Chain of Thought
Reinforcement Learning
Prompting Strategies

Mentioned

Sam Altman
Matthew Burman
Jimmy Apples
Karina Nen
Joanne
GPT
RS
ELO