Get AI summaries of any video or article — Sign up free
Shaping model behavior in GPT-5.1— the OpenAI Podcast Ep. 11 thumbnail

Shaping model behavior in GPT-5.1— the OpenAI Podcast Ep. 11

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

GPT-5.1 makes reasoning the default for all ChatGPT models, with the system dynamically choosing how much to think based on the prompt.

Briefing

GPT-5.1 brings a major shift in how OpenAI’s chat models behave: every model available in ChatGPT is now a reasoning model by default. Instead of always “thinking” for the same amount of time, the system can decide how much internal reasoning to do based on the prompt—skipping extra work for simple greetings, then allocating more time for harder questions, using tools when needed, and returning a refined answer. The practical payoff is broad: improved instruction following, better performance across evaluations, and more reliable help for tasks that benefit from deliberate problem-solving.

The release also targets a specific user-experience complaint that surfaced around GPT-5: the model sometimes felt colder or less intuitive. OpenAI traces that perception to multiple layers, not just tone. One factor was a shorter effective context carryover—users could feel the assistant “forgetting” important personal details after a limited number of turns, which can make conversations about sensitive situations feel distant. Another factor involved an “auto switcher” that moves users between chat-style and reasoning-style responses; when that switch happens mid-conversation—such as when someone shares difficult news—the answer can suddenly sound clinical, creating a jarring emotional mismatch.

GPT-5.1 addresses these issues by tuning the aggregate experience so the assistant feels warmer even while changing underlying behavior. It also improves custom instruction retention, a key control point for users who want the assistant to follow their preferences consistently. OpenAI frames this as a steerability problem: users tolerate quirks as long as they can correct them, but quirks become frustrating when the model can’t reliably carry forward instructions or context.

Personality is treated as both a user-facing feature and an engineering challenge. OpenAI introduces “personality” controls (described as response style and tone traits) while also emphasizing that “personality” in practice includes the whole harness around the model—latency, formatting, context window behavior, rate limiting, and even which model gets selected behind the scenes. That matters because users experience “personality” as the end-to-end chat experience, not just the text the model generates.

Under the hood, OpenAI describes the system as more than one set of weights. A reasoning model, a lighter reasoning variant, an auto switcher model, and tool-backed components work together, guided by UI and evaluation-driven switching logic. Feedback at OpenAI’s scale—800 million weekly active users—is handled by inspecting conversation links to diagnose where emotional tone, factuality, and latency break down.

Finally, the conversation ties model behavior to OpenAI’s long-running safety philosophy: maximize freedom while minimizing harm. Instead of blanket refusals, newer safety mechanisms aim to resolve requests without producing harmful content, with nuance that depends on context. Looking ahead, OpenAI expects more steerability and more personalization through features like memory, while still keeping users in control of what the system infers and stores. The message to users is straightforward: keep testing hard questions, because model updates can change outcomes quickly, and ask the assistant to help craft better prompts.

Cornell Notes

GPT-5.1 makes reasoning the default across ChatGPT: the assistant can choose how much to “think” based on the prompt, then refine answers and use tools when appropriate. OpenAI says this improves instruction following and overall evaluation results, but also aims to fix user-perceived coldness by adjusting context carryover and reducing jarring tone shifts caused by automatic switching between chat and reasoning styles. “Personality” is treated as an end-to-end experience shaped by response style controls plus the surrounding system (context window, latency, rate limits, and which internal model is selected). OpenAI also links emotional intelligence to measurable “user signals” research and to practical factors like memory and context logging. The future direction emphasizes more steerability and personalization while keeping users’ freedom and safety boundaries in balance.

What does it mean that all ChatGPT models are “reasoning models” in GPT-5.1?

GPT-5.1 can decide whether to spend extra compute on internal reasoning (“chain of thought”) depending on the prompt. For simple inputs like greetings, it won’t allocate much thinking time; for harder questions, it can take more time to refine its answer and use tools if necessary before responding. OpenAI describes this as improving intelligence and instruction following because the model can think before answering when the task warrants it.

Why did GPT-5 sometimes feel “colder,” and what changed in GPT-5.1?

OpenAI points to multiple causes. First, context carryover issues meant the assistant could lose important earlier details after a limited number of turns, which can feel emotionally distant in sensitive conversations. Second, an auto switcher could move users from chat-style to reasoning-style responses; if that switch happened after someone shared personal bad news, the assistant could suddenly sound clinical. GPT-5.1 focuses on making the overall experience warmer even while changing internal behavior.

How does OpenAI handle user control when models have different quirks and switching behavior?

Users can tolerate differences if they can steer the assistant. OpenAI improved custom instructions so preferences persist more consistently across turns. It also provides personality-style controls (traits like response length and formatting choices) so users can guide tone and format. On top of that, product work includes UI and model-switcher learning that uses evals and user signals to decide which response style fits different contexts.

How is “personality” defined beyond just the text the model outputs?

OpenAI says “personality” is overloaded, so it breaks it into components. There’s a personality feature (described as response style and tone traits such as concise vs lengthy responses and emoji usage). But the broader “personality” users feel also comes from the harness around the model: context window behavior, rate limiting (which can route users to different capabilities), latency, and even the app’s presentation. The goal is to map community feedback about personality back to the system components that create the perceived experience.

What does OpenAI mean by measuring progress in “emotional intelligence” (EQ)?

OpenAI calls out “user signals research,” including training reward models and using signals during RL (reinforcement learning) tied to user product data. The aim is to capture intent and context—what the user wants and how the model should respond given conversation history and memory. EQ is also linked to practical behaviors like remembering context correctly, logging memory, and using style choices that resonate with users.

How does memory fit into personalization and user experience?

Memory is described as the model writing down information it learns about a user from conversations so it can refer to it later. That reduces repetition (users don’t have to restate who they are or their preferences) and helps ground future answers. OpenAI also emphasizes user control: memories can be turned on/off or deleted in settings, and the system should be transparent about what it infers.

Review Questions

  1. What mechanisms in GPT-5.1 are responsible for both better reasoning and improved “warmth,” and how do they interact (context window, auto switching, and reasoning depth)?
  2. How does OpenAI reconcile “maximize freedom, minimize harm” with the need for models to be usable rather than defaulting to refusals?
  3. Which parts of the chat experience does OpenAI treat as part of “personality,” and why does that complicate post-training and evaluation?

Key Points

  1. 1

    GPT-5.1 makes reasoning the default for all ChatGPT models, with the system dynamically choosing how much to think based on the prompt.

  2. 2

    OpenAI attributes “cold” user perceptions to context carryover limits and to tone shifts caused by an auto switcher between chat and reasoning response styles.

  3. 3

    GPT-5.1 improves custom instruction retention so user preferences persist more reliably across turns and reduce frustration from lost instructions.

  4. 4

    Personality is treated as an end-to-end experience shaped by response style controls plus system-level factors like context window behavior, latency, rate limiting, and which internal model is selected.

  5. 5

    OpenAI uses conversation-level diagnostics (conversation links) and multiple signals—factuality, latency, and user experience—to decide when and how to switch response modes.

  6. 6

    Emotional intelligence is pursued through “user signals research,” including reward models and reinforcement learning signals tied to real user outcomes.

  7. 7

    Memory is positioned as proactive personalization that reduces repetition, while user settings allow turning memory on/off and deleting stored items.

Highlights

For the first time, all ChatGPT models are reasoning models by default, with variable reasoning depth chosen per prompt.
Perceived coldness is linked to both missing context and jarring style changes from an auto switcher that can shift tone mid-conversation.
“Personality” isn’t just wording—it includes latency, formatting, context handling, and even routing decisions caused by rate limiting.
OpenAI frames EQ measurement as “user signals research,” using reward models and RL signals tied to product data rather than a single subjective metric.
Safety is handled through nuance and “safe completions,” aiming to resolve requests without defaulting to blanket refusals.

Topics

Mentioned

  • Andrew Maine
  • Christina Kim
  • Lentia Ramen
  • Daniel Conorman
  • Kevin Wheel
  • Alex Luchska
  • RL