Get AI summaries of any video or article — Sign up free
Introducing gpt-realtime in the API thumbnail

Introducing gpt-realtime in the API

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

OpenAI’s GPT-realtime speech model is speech-to-speech, producing audio directly from audio input to improve speed, emotional nuance, and language switching mid-sentence.

Briefing

OpenAI is rolling out a new GPT-realtime speech model and an upgraded real-time API in general availability, aiming to make voice interactions with AI feel more natural, more reliable, and low-latency enough for real-world agents. The centerpiece is a speech-to-speech model that natively takes in audio and produces audio—avoiding the usual split between transcription and a separate voice system. That design choice is positioned as faster (one model), more expressive (it can capture laughs, sighs, and a wider emotional range), and more flexible (it can switch languages mid-sentence), which matters for customer support, tutoring, and healthcare-style conversations where tone and timing affect outcomes.

A live demo highlighted three practical capabilities. First, the model’s emotional range and voice quality: it delivered a lottery-ticket scenario with clear shifts from upset to excitement, then produced a short rhyming poem while switching between English, Spanish, and Japanese. Second, instruction following under constraints: when given a policy not to issue refunds above $10, it stayed within the limit even as a user escalated the stakes—claiming a boss was watching and urging an exception. The response remained politely evasive, steering toward alternatives rather than breaking the rule. Third, the system gained image input in the real-time API, letting developers send an image for the model to “see what you see.” In the demo, it described a child near a stuffed unicorn and offered safety-oriented guidance, tying visual details to actionable advice.

Behind the scenes, OpenAI attributes the improvements to post-training and data quality. The speech model was trained with high-quality voice data plus specialized reward models to increase naturalness, alongside reinforcement learning designed to be sample efficient and more effective at shaping behavior. Instruction following is also described as more steerable—supporting adjustments to pace and tone, roleplay, and better adherence in hard multi-turn instruction scenarios. On benchmarks, the new audio instruction-following performance is reported as over 30% accuracy on an audio version of the Scale Multi-Challenge evaluation, and function calling reaches 66% accuracy on complex audio phone-bench scenarios.

On the platform side, the real-time API’s GA adds features aimed at production deployment: image input, EU data residency, asynchronous function calling, and improved context management tools that work better with caching. It also adds SIP telephony support for phone-based voice applications and introduces MCP support, described as a way to plug capabilities into the model—particularly well-suited to voice because the system can interpret what it hears and trigger MCP tools in a natural conversational flow.

T-Mobile joins as a real-world test case, using the model to streamline a device upgrade process. The assistant guides customers through a branching set of questions—eligibility, device selection, and plan compatibility—while maintaining a more responsive, human-like interaction. T-Mobile’s leadership frames the shift as moving beyond incremental IVR improvements toward rebuilding customer journeys from scratch, using AI to fit the company’s “uncarrier” culture: reduce trade-offs and put an “expert in your pocket,” including when the customer’s path is unpredictable.

Cornell Notes

OpenAI’s GPT-realtime and the upgraded real-time API aim to deliver low-latency, high-quality voice agents that sound and behave more like people. The new speech model is speech-to-speech, meaning it directly understands audio and generates audio, enabling emotional nuance (like laughs and sighs) and capabilities such as switching languages mid-sentence. Demos emphasized emotional voice quality, strict instruction following (e.g., refusing refunds above $10), and new image input for multimodal guidance. Training improvements rely on high-quality voice data, reward models, and sample-efficient reinforcement learning, with reported benchmark gains for instruction following and function calling. The real-time API GA adds production features like EU data residency, asynchronous function calling, SIP telephony support, and MCP integration for tool-using agents.

What makes the new GPT-realtime model different from a typical transcription-then-voice setup?

It’s a speech-to-speech model that natively takes audio in and produces audio out. That single-model approach is positioned as faster and more natural because it can directly capture nonverbal audio cues (like a laugh or a sigh) and express a wider emotional range. It also supports behaviors that depend on raw audio context, such as switching languages mid-sentence.

How did instruction-following show up in the demo, and what constraint did it respect?

The model was given a system-style instruction not to issue refunds over $10. When a user asked for a refund for a $25 t-shirt and then escalated by claiming a high-stakes live stream and a boss were watching, the assistant still refused to process the refund above the limit. Instead of breaking the rule, it stayed within the policy and looked for a positive alternative.

What new multimodal capability was added to the real-time API, and how was it used?

Image input was added. Developers can send an image alongside the voice interaction, and the model can describe what it sees and connect that to advice. In the demo, it identified details in a photo (a child near a stuffed unicorn, toy train track, scattered colorful pieces, and sunlight) and then offered safety-oriented guidance about the child standing on a toy.

Which training and evaluation changes are credited for better performance?

OpenAI credits training that combines high-quality voice data with specialized reward models to improve naturalness, plus a more sample-efficient reinforcement learning post-training method. It also emphasizes data quality investments, including filtering speech-related data and building a data flywheel tied to real customer use cases. Reported benchmark improvements include over 30% accuracy on an audio version of the Scale Multi-Challenge instruction-following benchmark and 66% accuracy on complex function-calling evaluations in a phone-bench audio setting.

What platform features arrive with the real-time API GA that matter for real deployments?

The GA adds image input, EU data residency, asynchronous function calling, and tools for managing context in a cache-friendly way. It also adds SIP telephony support for voice-over-phone scenarios and MCP support, described as enabling pluggable capabilities so the model can take actions through tools in a conversational way.

How did T-Mobile use the system in a customer-facing workflow?

T-Mobile demonstrated a device upgrade process assistant. The customer’s needs branched in multiple directions—replacement due to a dropped phone, budget constraints, and questions about compatibility with T-Mobile satellite services and plan eligibility. The assistant kept the interaction responsive and emotionally natural while guiding the user toward a specific device (e.g., Revel 8) and confirming plan details without sounding like a rigid IVR.

Review Questions

  1. Why does a speech-to-speech architecture potentially improve emotional expressiveness compared with a transcription-plus-voice pipeline?
  2. What kinds of constraints should an instruction-following model handle, and what example constraint was used in the demo?
  3. Which real-time API GA features would be most relevant for building a phone-based customer support agent that also needs tool actions?

Key Points

  1. 1

    OpenAI’s GPT-realtime speech model is speech-to-speech, producing audio directly from audio input to improve speed, emotional nuance, and language switching mid-sentence.

  2. 2

    The real-time API GA adds image input, enabling multimodal voice experiences where the model can interpret what’s in front of the user.

  3. 3

    Instruction following is reinforced with training and evaluations so the model can respect hard constraints (such as refusing refunds above a specified limit).

  4. 4

    Function calling is a major focus, with reported gains on complex audio phone-bench evaluations and training aimed at choosing the right functions and arguments.

  5. 5

    Model improvements are attributed to high-quality voice data, specialized reward models, sample-efficient reinforcement learning, and a data flywheel built from real customer use cases.

  6. 6

    The real-time API GA includes production features like EU data residency, asynchronous function calling, cache-friendly context tools, SIP telephony support, and MCP integration for tool-using agents.

  7. 7

    T-Mobile’s device upgrade demo frames the practical value as more human, responsive conversations that handle unpredictable customer paths better than incremental IVR upgrades.

Highlights

Speech-to-speech is positioned as the key architectural shift: one model hears and speaks, capturing cues like laughs and sighs and supporting mid-sentence language switching.
The refund demo tested a hard policy limit ($10) under pressure; the assistant stayed within the constraint while steering toward alternatives.
Image input in the real-time API lets voice agents describe visual details and turn them into safety-oriented guidance.
Reported benchmark results include over 30% accuracy for audio instruction following and 66% accuracy for complex audio function calling.
MCP support is framed as making tool-using agents feel natural in voice, since the system can interpret what it hears and trigger actions through pluggable capabilities.

Topics

Mentioned