Introducing gpt-realtime in the API
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI’s GPT-realtime speech model is speech-to-speech, producing audio directly from audio input to improve speed, emotional nuance, and language switching mid-sentence.
Briefing
OpenAI is rolling out a new GPT-realtime speech model and an upgraded real-time API in general availability, aiming to make voice interactions with AI feel more natural, more reliable, and low-latency enough for real-world agents. The centerpiece is a speech-to-speech model that natively takes in audio and produces audio—avoiding the usual split between transcription and a separate voice system. That design choice is positioned as faster (one model), more expressive (it can capture laughs, sighs, and a wider emotional range), and more flexible (it can switch languages mid-sentence), which matters for customer support, tutoring, and healthcare-style conversations where tone and timing affect outcomes.
A live demo highlighted three practical capabilities. First, the model’s emotional range and voice quality: it delivered a lottery-ticket scenario with clear shifts from upset to excitement, then produced a short rhyming poem while switching between English, Spanish, and Japanese. Second, instruction following under constraints: when given a policy not to issue refunds above $10, it stayed within the limit even as a user escalated the stakes—claiming a boss was watching and urging an exception. The response remained politely evasive, steering toward alternatives rather than breaking the rule. Third, the system gained image input in the real-time API, letting developers send an image for the model to “see what you see.” In the demo, it described a child near a stuffed unicorn and offered safety-oriented guidance, tying visual details to actionable advice.
Behind the scenes, OpenAI attributes the improvements to post-training and data quality. The speech model was trained with high-quality voice data plus specialized reward models to increase naturalness, alongside reinforcement learning designed to be sample efficient and more effective at shaping behavior. Instruction following is also described as more steerable—supporting adjustments to pace and tone, roleplay, and better adherence in hard multi-turn instruction scenarios. On benchmarks, the new audio instruction-following performance is reported as over 30% accuracy on an audio version of the Scale Multi-Challenge evaluation, and function calling reaches 66% accuracy on complex audio phone-bench scenarios.
On the platform side, the real-time API’s GA adds features aimed at production deployment: image input, EU data residency, asynchronous function calling, and improved context management tools that work better with caching. It also adds SIP telephony support for phone-based voice applications and introduces MCP support, described as a way to plug capabilities into the model—particularly well-suited to voice because the system can interpret what it hears and trigger MCP tools in a natural conversational flow.
T-Mobile joins as a real-world test case, using the model to streamline a device upgrade process. The assistant guides customers through a branching set of questions—eligibility, device selection, and plan compatibility—while maintaining a more responsive, human-like interaction. T-Mobile’s leadership frames the shift as moving beyond incremental IVR improvements toward rebuilding customer journeys from scratch, using AI to fit the company’s “uncarrier” culture: reduce trade-offs and put an “expert in your pocket,” including when the customer’s path is unpredictable.
Cornell Notes
OpenAI’s GPT-realtime and the upgraded real-time API aim to deliver low-latency, high-quality voice agents that sound and behave more like people. The new speech model is speech-to-speech, meaning it directly understands audio and generates audio, enabling emotional nuance (like laughs and sighs) and capabilities such as switching languages mid-sentence. Demos emphasized emotional voice quality, strict instruction following (e.g., refusing refunds above $10), and new image input for multimodal guidance. Training improvements rely on high-quality voice data, reward models, and sample-efficient reinforcement learning, with reported benchmark gains for instruction following and function calling. The real-time API GA adds production features like EU data residency, asynchronous function calling, SIP telephony support, and MCP integration for tool-using agents.
What makes the new GPT-realtime model different from a typical transcription-then-voice setup?
How did instruction-following show up in the demo, and what constraint did it respect?
What new multimodal capability was added to the real-time API, and how was it used?
Which training and evaluation changes are credited for better performance?
What platform features arrive with the real-time API GA that matter for real deployments?
How did T-Mobile use the system in a customer-facing workflow?
Review Questions
- Why does a speech-to-speech architecture potentially improve emotional expressiveness compared with a transcription-plus-voice pipeline?
- What kinds of constraints should an instruction-following model handle, and what example constraint was used in the demo?
- Which real-time API GA features would be most relevant for building a phone-based customer support agent that also needs tool actions?
Key Points
- 1
OpenAI’s GPT-realtime speech model is speech-to-speech, producing audio directly from audio input to improve speed, emotional nuance, and language switching mid-sentence.
- 2
The real-time API GA adds image input, enabling multimodal voice experiences where the model can interpret what’s in front of the user.
- 3
Instruction following is reinforced with training and evaluations so the model can respect hard constraints (such as refusing refunds above a specified limit).
- 4
Function calling is a major focus, with reported gains on complex audio phone-bench evaluations and training aimed at choosing the right functions and arguments.
- 5
Model improvements are attributed to high-quality voice data, specialized reward models, sample-efficient reinforcement learning, and a data flywheel built from real customer use cases.
- 6
The real-time API GA includes production features like EU data residency, asynchronous function calling, cache-friendly context tools, SIP telephony support, and MCP integration for tool-using agents.
- 7
T-Mobile’s device upgrade demo frames the practical value as more human, responsive conversations that handle unpredictable customer paths better than incremental IVR upgrades.