OpenAI GPT-4o | First Impressions and Some Testing + API

TL;DR

GPT-4o is positioned as a real-time multimodal model with reported conversational latency around 320 milliseconds.

Briefing Cornell Notes

Briefing

OpenAI’s newly released GPT-4o models are positioned as a real-time, multimodal “reasoning” system that can work across text, images, and audio with notably low latency—reported around 320 milliseconds, roughly in the range of typical human conversational turn-taking. That speed matters because it makes interactive voice and vision use feel less like a chat window and more like a responsive assistant. The update also claims major cost and performance improvements, including “50% cheaper” API pricing versus GPT-4o’s prior baseline and stronger understanding for vision and audio.

Early testing described in the transcript focuses heavily on what’s available right now through the API: text and image inputs with text outputs. Audio input/output is mentioned as not yet supported in the API documentation at the time of testing, even though the live stream demonstrations included voice features such as interruption handling, real-time tone/emotion adjustments, and voice input cues (for example, responding differently when a user speaks in a “sad” tone). The tester therefore builds scripts around image analysis first, using base64 encoding and batching multiple images from a folder into a single “image analyzer” workflow.

In one image workflow, multiple slide images—each representing different “architectures” in a mixture-of-models setup—are fed into GPT-4o for description and explanation. The output is treated as strong because it produces structured summaries for each image and then synthesizes a final explanation that ties the architectures together. The transcript highlights a “mixture of models” framing: responses generated by different model roles (described as “king,” “co-founder,” and “democracy” components) are refined, discussed, and voted on to produce a more well-rounded answer.

The testing then shifts to direct image-based reasoning. A drawn triangle image is used to ask for calculations, with GPT-4o reportedly performing checks like verifying the triangle inequality theorem, determining whether it’s a right triangle via the Pythagorean theorem, and computing area. The tester also compares speed against GPT-4 Turbo using a longer writing task (three paragraphs about life in Paris in the 1800s), reporting a large gap in throughput: GPT-4o around 110 tokens per second versus GPT-4 Turbo around 20 tokens per second—described as roughly five times faster—along with lower latency and fewer tokens.

Finally, a couple of logic-style prompts are used as quick sanity checks. For a marble-in-a-microwave puzzle, GPT-4o is reported to give an answer that matches the expected outcome (the marble ends up on the microwave tray/floor area rather than inside the cup as originally oriented). Another prompt—writing sentences ending with “apples”—is used to compare accuracy, with GPT-4 Turbo reportedly hitting 10/10 while GPT-4o is said to miss one.

A major theme is access: the transcript claims OpenAI plans to bring GPT-4o to free users, which would change the competitive landscape against other assistants and multimodal models. The tester closes by promising a deeper follow-up on Wednesday, after more practical evaluation—especially once audio capabilities become testable via the API.

Cornell Notes

GPT-4o is presented as a multimodal model built for real-time interaction, with reported conversational latency around 320 ms and claims of 50% cheaper API costs. The transcript’s hands-on tests emphasize image understanding and reasoning because the API at the time accepts text and images (audio support wasn’t available yet). In image tests, GPT-4o produced structured slide explanations and performed math reasoning from a drawn triangle (including Pythagorean checks and area calculation). Speed comparisons against GPT-4 Turbo show much higher throughput (about 110 tokens/sec vs 20 tokens/sec), suggesting faster responses for comparable tasks. Quick logic checks show mixed results: GPT-4o handled the marble puzzle correctly, while GPT-4 Turbo reportedly performed better on an “apples” sentence constraint task.

What makes GPT-4o feel different for interactive use, based on the transcript’s reported metrics?

The key differentiator is latency: an average around 320 milliseconds, described as similar to human response time in conversation. That low delay is paired with claims of faster API performance (the transcript later reports ~110 tokens/sec for GPT-4o vs ~20 tokens/sec for GPT-4 Turbo), which together aim to make voice/vision interactions feel immediate rather than “wait for the next message.”

Why did the tester focus on images instead of audio?

The transcript notes that the API documentation at the time accepted text and images with text output, but audio input/output wasn’t available yet. Even though the live stream showed voice features like emotion/tone changes and interruption handling, the tester couldn’t run audio tests through the API and therefore built image-first scripts.

How did the transcript’s image workflow demonstrate GPT-4o’s understanding?

A script batches multiple images (encoded with base64) from a folder and sends them to GPT-4o for analysis. In one example, the model generates per-image summaries and then a consolidated explanation of a “mixture of models” system, including how different roles contribute (refinement, discussion, and voting) to produce a final answer.

What kinds of reasoning tasks were tested using a single image?

A hand-drawn triangle image was used to ask for calculations and geometry checks. GPT-4o reportedly handled triangle inequality verification, right-triangle detection via the Pythagorean theorem, and area calculation, and the transcript describes the response as fast enough to check latency in near real time.

What did the transcript claim about speed versus GPT-4 Turbo?

For a longer writing task, the transcript reports a major throughput gap: GPT-4o at roughly 110 tokens per second versus GPT-4 Turbo at about 20 tokens per second. The tester also notes the latency difference and that GPT-4o produced fewer tokens in the comparison, describing the overall effect as about five times faster.

How did GPT-4o perform on the logic-style prompts compared with GPT-4 Turbo?

On a marble-in-a-microwave puzzle, GPT-4o is reported to give the correct outcome (marble left on the microwave tray/floor area rather than remaining inside the upside-down cup as oriented). On a constrained writing task—writing 10 sentences ending with “apples”—GPT-4o is said to produce 9/10, while GPT-4 Turbo reportedly gets all 10 correct.

Review Questions

What latency and cost claims are associated with GPT-4o, and why do they matter for real-time multimodal interaction?
Which transcript tests were possible via the API at the time, and what limitation prevented audio evaluation?
Based on the speed and logic tests, where does GPT-4o look strongest, and where did GPT-4 Turbo outperform it?

Key Points

1
GPT-4o is positioned as a real-time multimodal model with reported conversational latency around 320 milliseconds.
2
OpenAI claims GPT-4o API pricing is 50% cheaper than the prior baseline and that it improves vision/audio understanding.
3
API testing in the transcript focused on text+image inputs because audio input/output wasn’t available yet in the documented interface.
4
Image reasoning tests included structured slide explanations and geometry/math calculations from a drawn triangle (Pythagorean and area checks).
5
Throughput comparisons reported about 110 tokens/sec for GPT-4o versus about 20 tokens/sec for GPT-4 Turbo, roughly a fivefold speed difference.
6
Quick logic checks were mixed: GPT-4o reportedly solved the marble puzzle correctly but missed one sentence in an “apples” constraint task where GPT-4 Turbo hit 10/10.
7
A major access claim is that GPT-4o will be brought to free users, potentially reshaping competition with other assistants.

Highlights

Reported average latency of ~320 ms is framed as the difference between “chat” and conversation-like responsiveness.

The transcript’s image tests show GPT-4o producing both per-image summaries and a synthesized explanation across multiple slides.

A speed comparison claims GPT-4o runs at ~110 tokens/sec versus ~20 tokens/sec for GPT-4 Turbo—about five times faster.

Topics

GPT-4o
Multimodal Reasoning
Low Latency
Vision API
Token Throughput