Get AI summaries of any video or article — Sign up free
Audio Models in the API thumbnail

Audio Models in the API

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

OpenAI released GPT-4o transcribe and GPT-4o mini transcribe as new speech-to-text models, reporting lower word error rates than Whisper across languages tested.

Briefing

OpenAI is rolling out new audio-focused models and API tools aimed at making voice agents as reliable and developer-friendly as today’s text agents. The centerpiece is a pair of upgraded speech-to-text systems—GPT-4o transcribe and GPT-4o mini transcribe—that outperform the prior Whisper line across every language tested, plus a new text-to-speech model, GPT-4o mini TTS, that lets developers control not only what is said but how it’s delivered. Together with an Agents SDK update, these additions are designed to turn existing text agent workflows into end-to-end voice experiences with minimal code changes.

On the transcription side, OpenAI positions GPT-4o transcribe and GPT-4o mini transcribe as state-of-the-art options measured by word error rate, where lower error rates mean more accurate recognition. The larger model is built on OpenAI’s “Lar” speech model foundation, trained on trillions of audio tokens and incorporating the latest model technologies and architecture. A smaller distilled variant, GPT-4o mini transcribe, targets faster, more efficient operation while retaining strong transcription quality. OpenAI also highlights pricing parity for GPT-4o transcribe at 0.6 cents per minute (same as Whisper) and a lower cost for GPT-4o mini transcribe at 0.3 cents per minute.

To support real-time voice interactions, the speech-to-text APIs add streaming: developers can send continuous audio and receive continuous text output, enabling faster conversational loops. The APIs also bundle “hard problems” that typically complicate voice apps—noise cancellation to reduce background interference and a semantic voice activity detector that segments audio based on when the system believes the user has finished speaking. OpenAI says these capabilities are available both in the standard speech-to-text APIs and in its real-time API.

For speech output, OpenAI introduces GPT-4o mini TTS, presented through an interactive demo (openai.fm). The key capability is an “instructions” field that steers delivery characteristics—tone, pacing, and style—so developers can prompt for specific performance rather than relying on a fixed voice personality. The demo shows how the same underlying voice can be prompted for dramatically different styles, from high-energy “mad scientist” delivery to a calmer, supportive tone. GPT-4o mini TTS is priced at 1 cent per minute, framed as an economical way to generate lively audio.

The final piece is an Agents SDK update that makes it straightforward to convert text agents into voice agents. OpenAI’s Agents SDK already packages reliability features such as guardrails, tool/function calling, and structured agent execution. The new “voice pipeline” concept sits alongside an existing text workflow: audio from a UI is streamed to the backend, speech is transcribed, the text workflow runs (including tool calls), and the response is converted back into speech for playback. In the demo, a customer-support agent that can look up Patagonia jacket orders and handle refunds is upgraded to a phone-style voice interaction with only a few lines of code.

OpenAI also updates tracing UI to support audio, letting developers inspect timelines, latencies, and errors while listening to recorded requests. The release is paired with a short contest: developers are invited to use openai.fm to create creative applications of the text-to-speech technology and share them on X/Twitter, with three winners receiving a Teenage Engineering special-edition radio.

Cornell Notes

OpenAI is launching new audio models and API features to help developers build voice agents that behave like reliable text agents. GPT-4o transcribe and GPT-4o mini transcribe deliver lower word error rates than Whisper across languages, with streaming transcription plus noise cancellation and semantic voice activity detection. GPT-4o mini TTS adds controllable speech output via an “instructions” field, letting developers shape tone and pacing. An Agents SDK update introduces a voice pipeline that wraps existing text-agent workflows with speech-to-text on input and text-to-speech on output. OpenAI also expands tracing UI to include audio, supporting debugging with timelines, latencies, and error inspection.

What makes GPT-4o transcribe and GPT-4o mini transcribe a step up from Whisper?

OpenAI measures transcription quality using word error rate (WER), where a lower WER means fewer incorrect words. GPT-4o transcribe and GPT-4o mini transcribe are reported to perform almost like the best prior options on every language tested, outperforming the previous Whisper generation across the board. Architecturally, GPT-4o transcribe is based on the Lar speech model and trained on trillions of audio tokens, while GPT-4o mini transcribe is a distilled smaller model designed to be faster and more efficient while keeping strong transcription capability. Pricing is also positioned as practical: GPT-4o transcribe at 0.6 cents per minute (same as Whisper) and GPT-4o mini transcribe at 0.3 cents per minute.

How do streaming transcription and voice activity detection change how voice apps are built?

Streaming lets developers send a continuous stream of audio and receive a continuous stream of text, which supports lower-latency, more natural conversations. OpenAI also bundles noise cancellation so background sounds don’t derail recognition. The semantic voice activity detector segments audio based on when the model believes the user has finished speaking, reducing the need for developers to manually detect end-of-utterance or handle half-spoken inputs.

What control does GPT-4o mini TTS give developers over speech output?

GPT-4o mini TTS introduces an “instructions” field that guides how the model speaks the text—covering tone, pacing, and style. The demo emphasizes that the personality/tone isn’t baked into the model; it’s driven by prompt instructions. By changing the instructions, the same text can be delivered in very different ways, such as a chaotic “mad scientist” style versus a supportive, calm delivery.

How does the Agents SDK update turn a text agent into a voice agent?

The update adds a voice pipeline that wraps an existing text-agent workflow. Audio chunks from the UI are accumulated, then sent through speech-to-text. The resulting transcript feeds into the existing Agents SDK runner (including tool/function calls and guardrails), and the agent’s text output is converted back into speech via text-to-speech. The demo shows a customer-support agent (with web search for styling and access to past orders/refunds) being upgraded so a user can ask, “What was my last order?” and receive a spoken response.

What does audio-aware tracing add for debugging voice agents?

Tracing UI is updated to support audio, so developers can inspect traces from recent conversations and click into events tied to the first request. The system integrates playback with metadata such as timelines, latencies, and errors, making it easier to diagnose where delays or failures occur across the speech-to-text, agent reasoning/tool calls, and text-to-speech steps.

Review Questions

  1. How does word error rate (WER) function as a metric, and what does a lower WER imply for transcription quality?
  2. Describe the role of semantic voice activity detection in a streaming voice pipeline.
  3. What is the purpose of the “voice pipeline” in the Agents SDK, and how does it connect speech-to-text, an existing text-agent workflow, and text-to-speech?

Key Points

  1. 1

    OpenAI released GPT-4o transcribe and GPT-4o mini transcribe as new speech-to-text models, reporting lower word error rates than Whisper across languages tested.

  2. 2

    Speech-to-text APIs now support streaming, enabling continuous audio input and continuous text output for faster voice interactions.

  3. 3

    Noise cancellation and semantic voice activity detection are bundled into speech-to-text, reducing developer burden for handling background sound and end-of-utterance timing.

  4. 4

    OpenAI introduced GPT-4o mini TTS with an “instructions” field so developers can control tone, pacing, and delivery style rather than relying on a fixed voice personality.

  5. 5

    An Agents SDK update adds a voice pipeline that converts existing text-agent workflows into voice agents by adding speech-to-text input and text-to-speech output.

  6. 6

    Tracing UI was updated to support audio, letting developers debug with playback plus timelines, latencies, and error details.

Highlights

GPT-4o transcribe and GPT-4o mini transcribe are positioned as state-of-the-art across languages, with accuracy reported via word error rate and strong performance relative to Whisper.
Streaming transcription plus semantic voice activity detection aims to make voice conversations feel responsive without requiring custom end-of-speech logic.
GPT-4o mini TTS uses an “instructions” field to steer how speech sounds—tone and pacing—driven by prompt rather than a fixed personality.
The Agents SDK “voice pipeline” wraps existing text-agent logic so audio in becomes transcript, agent reasoning runs, and speech out is generated with minimal code changes.

Topics

Mentioned

  • Olivia Gar
  • Shen
  • Yaroslav
  • Jeff Harris
  • API
  • TTS
  • WER
  • UI
  • SDK
  • LLM