OpenAI GPT-4o | First Impressions and Some Testing + API
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT-4o is positioned as a real-time multimodal model with reported conversational latency around 320 milliseconds.
Briefing
OpenAI’s newly released GPT-4o models are positioned as a real-time, multimodal “reasoning” system that can work across text, images, and audio with notably low latency—reported around 320 milliseconds, roughly in the range of typical human conversational turn-taking. That speed matters because it makes interactive voice and vision use feel less like a chat window and more like a responsive assistant. The update also claims major cost and performance improvements, including “50% cheaper” API pricing versus GPT-4o’s prior baseline and stronger understanding for vision and audio.
Early testing described in the transcript focuses heavily on what’s available right now through the API: text and image inputs with text outputs. Audio input/output is mentioned as not yet supported in the API documentation at the time of testing, even though the live stream demonstrations included voice features such as interruption handling, real-time tone/emotion adjustments, and voice input cues (for example, responding differently when a user speaks in a “sad” tone). The tester therefore builds scripts around image analysis first, using base64 encoding and batching multiple images from a folder into a single “image analyzer” workflow.
In one image workflow, multiple slide images—each representing different “architectures” in a mixture-of-models setup—are fed into GPT-4o for description and explanation. The output is treated as strong because it produces structured summaries for each image and then synthesizes a final explanation that ties the architectures together. The transcript highlights a “mixture of models” framing: responses generated by different model roles (described as “king,” “co-founder,” and “democracy” components) are refined, discussed, and voted on to produce a more well-rounded answer.
The testing then shifts to direct image-based reasoning. A drawn triangle image is used to ask for calculations, with GPT-4o reportedly performing checks like verifying the triangle inequality theorem, determining whether it’s a right triangle via the Pythagorean theorem, and computing area. The tester also compares speed against GPT-4 Turbo using a longer writing task (three paragraphs about life in Paris in the 1800s), reporting a large gap in throughput: GPT-4o around 110 tokens per second versus GPT-4 Turbo around 20 tokens per second—described as roughly five times faster—along with lower latency and fewer tokens.
Finally, a couple of logic-style prompts are used as quick sanity checks. For a marble-in-a-microwave puzzle, GPT-4o is reported to give an answer that matches the expected outcome (the marble ends up on the microwave tray/floor area rather than inside the cup as originally oriented). Another prompt—writing sentences ending with “apples”—is used to compare accuracy, with GPT-4 Turbo reportedly hitting 10/10 while GPT-4o is said to miss one.
A major theme is access: the transcript claims OpenAI plans to bring GPT-4o to free users, which would change the competitive landscape against other assistants and multimodal models. The tester closes by promising a deeper follow-up on Wednesday, after more practical evaluation—especially once audio capabilities become testable via the API.
Cornell Notes
GPT-4o is presented as a multimodal model built for real-time interaction, with reported conversational latency around 320 ms and claims of 50% cheaper API costs. The transcript’s hands-on tests emphasize image understanding and reasoning because the API at the time accepts text and images (audio support wasn’t available yet). In image tests, GPT-4o produced structured slide explanations and performed math reasoning from a drawn triangle (including Pythagorean checks and area calculation). Speed comparisons against GPT-4 Turbo show much higher throughput (about 110 tokens/sec vs 20 tokens/sec), suggesting faster responses for comparable tasks. Quick logic checks show mixed results: GPT-4o handled the marble puzzle correctly, while GPT-4 Turbo reportedly performed better on an “apples” sentence constraint task.
What makes GPT-4o feel different for interactive use, based on the transcript’s reported metrics?
Why did the tester focus on images instead of audio?
How did the transcript’s image workflow demonstrate GPT-4o’s understanding?
What kinds of reasoning tasks were tested using a single image?
What did the transcript claim about speed versus GPT-4 Turbo?
How did GPT-4o perform on the logic-style prompts compared with GPT-4 Turbo?
Review Questions
- What latency and cost claims are associated with GPT-4o, and why do they matter for real-time multimodal interaction?
- Which transcript tests were possible via the API at the time, and what limitation prevented audio evaluation?
- Based on the speed and logic tests, where does GPT-4o look strongest, and where did GPT-4 Turbo outperform it?
Key Points
- 1
GPT-4o is positioned as a real-time multimodal model with reported conversational latency around 320 milliseconds.
- 2
OpenAI claims GPT-4o API pricing is 50% cheaper than the prior baseline and that it improves vision/audio understanding.
- 3
API testing in the transcript focused on text+image inputs because audio input/output wasn’t available yet in the documented interface.
- 4
Image reasoning tests included structured slide explanations and geometry/math calculations from a drawn triangle (Pythagorean and area checks).
- 5
Throughput comparisons reported about 110 tokens/sec for GPT-4o versus about 20 tokens/sec for GPT-4 Turbo, roughly a fivefold speed difference.
- 6
Quick logic checks were mixed: GPT-4o reportedly solved the marble puzzle correctly but missed one sentence in an “apples” constraint task where GPT-4 Turbo hit 10/10.
- 7
A major access claim is that GPT-4o will be brought to free users, potentially reshaping competition with other assistants.