Audio Models in the API
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI released GPT-4o transcribe and GPT-4o mini transcribe as new speech-to-text models, reporting lower word error rates than Whisper across languages tested.
Briefing
OpenAI is rolling out new audio-focused models and API tools aimed at making voice agents as reliable and developer-friendly as today’s text agents. The centerpiece is a pair of upgraded speech-to-text systems—GPT-4o transcribe and GPT-4o mini transcribe—that outperform the prior Whisper line across every language tested, plus a new text-to-speech model, GPT-4o mini TTS, that lets developers control not only what is said but how it’s delivered. Together with an Agents SDK update, these additions are designed to turn existing text agent workflows into end-to-end voice experiences with minimal code changes.
On the transcription side, OpenAI positions GPT-4o transcribe and GPT-4o mini transcribe as state-of-the-art options measured by word error rate, where lower error rates mean more accurate recognition. The larger model is built on OpenAI’s “Lar” speech model foundation, trained on trillions of audio tokens and incorporating the latest model technologies and architecture. A smaller distilled variant, GPT-4o mini transcribe, targets faster, more efficient operation while retaining strong transcription quality. OpenAI also highlights pricing parity for GPT-4o transcribe at 0.6 cents per minute (same as Whisper) and a lower cost for GPT-4o mini transcribe at 0.3 cents per minute.
To support real-time voice interactions, the speech-to-text APIs add streaming: developers can send continuous audio and receive continuous text output, enabling faster conversational loops. The APIs also bundle “hard problems” that typically complicate voice apps—noise cancellation to reduce background interference and a semantic voice activity detector that segments audio based on when the system believes the user has finished speaking. OpenAI says these capabilities are available both in the standard speech-to-text APIs and in its real-time API.
For speech output, OpenAI introduces GPT-4o mini TTS, presented through an interactive demo (openai.fm). The key capability is an “instructions” field that steers delivery characteristics—tone, pacing, and style—so developers can prompt for specific performance rather than relying on a fixed voice personality. The demo shows how the same underlying voice can be prompted for dramatically different styles, from high-energy “mad scientist” delivery to a calmer, supportive tone. GPT-4o mini TTS is priced at 1 cent per minute, framed as an economical way to generate lively audio.
The final piece is an Agents SDK update that makes it straightforward to convert text agents into voice agents. OpenAI’s Agents SDK already packages reliability features such as guardrails, tool/function calling, and structured agent execution. The new “voice pipeline” concept sits alongside an existing text workflow: audio from a UI is streamed to the backend, speech is transcribed, the text workflow runs (including tool calls), and the response is converted back into speech for playback. In the demo, a customer-support agent that can look up Patagonia jacket orders and handle refunds is upgraded to a phone-style voice interaction with only a few lines of code.
OpenAI also updates tracing UI to support audio, letting developers inspect timelines, latencies, and errors while listening to recorded requests. The release is paired with a short contest: developers are invited to use openai.fm to create creative applications of the text-to-speech technology and share them on X/Twitter, with three winners receiving a Teenage Engineering special-edition radio.
Cornell Notes
OpenAI is launching new audio models and API features to help developers build voice agents that behave like reliable text agents. GPT-4o transcribe and GPT-4o mini transcribe deliver lower word error rates than Whisper across languages, with streaming transcription plus noise cancellation and semantic voice activity detection. GPT-4o mini TTS adds controllable speech output via an “instructions” field, letting developers shape tone and pacing. An Agents SDK update introduces a voice pipeline that wraps existing text-agent workflows with speech-to-text on input and text-to-speech on output. OpenAI also expands tracing UI to include audio, supporting debugging with timelines, latencies, and error inspection.
What makes GPT-4o transcribe and GPT-4o mini transcribe a step up from Whisper?
How do streaming transcription and voice activity detection change how voice apps are built?
What control does GPT-4o mini TTS give developers over speech output?
How does the Agents SDK update turn a text agent into a voice agent?
What does audio-aware tracing add for debugging voice agents?
Review Questions
- How does word error rate (WER) function as a metric, and what does a lower WER imply for transcription quality?
- Describe the role of semantic voice activity detection in a streaming voice pipeline.
- What is the purpose of the “voice pipeline” in the Agents SDK, and how does it connect speech-to-text, an existing text-agent workflow, and text-to-speech?
Key Points
- 1
OpenAI released GPT-4o transcribe and GPT-4o mini transcribe as new speech-to-text models, reporting lower word error rates than Whisper across languages tested.
- 2
Speech-to-text APIs now support streaming, enabling continuous audio input and continuous text output for faster voice interactions.
- 3
Noise cancellation and semantic voice activity detection are bundled into speech-to-text, reducing developer burden for handling background sound and end-of-utterance timing.
- 4
OpenAI introduced GPT-4o mini TTS with an “instructions” field so developers can control tone, pacing, and delivery style rather than relying on a fixed voice personality.
- 5
An Agents SDK update adds a voice pipeline that converts existing text-agent workflows into voice agents by adding speech-to-text input and text-to-speech output.
- 6
Tracing UI was updated to support audio, letting developers debug with playback plus timelines, latencies, and error details.