Open AI Humbles EVERYONE. This Chatbot FEELS Alive!

TL;DR

GPT 40 is presented as a faster, near real-time flagship model with strong multimodal capability (text, audio, vision) and improved factual output in demos.

Briefing Cornell Notes

Briefing

OpenAI’s latest ChatGPT overhaul centers on GPT-4-class performance delivered in near real time—plus a new “omni” interaction style that can take in text, audio, and images while responding with human-like timing and emotion. The flagship model, introduced as GPT 40, is positioned as faster than prior GPT-4 and GPT-4 Turbo options, with demos showing markedly quicker generation and stronger factual recall (including an example where GPT 40 produced about 60 facts versus roughly 20 from the older GPT-4 in a slower run). It also lands in the API immediately and is described as cheaper, reinforcing the push to make the most capable model broadly usable.

The most eye-catching change is the new voice-and-vision interface for ChatGPT, rolling out gradually over the next few weeks and initially limited to ChatGPT Plus users. In the demo, the assistant speaks with more expressive, multi-voice options (including both male and female voices), and the conversation feels interruptible: it can stop when the user cuts in, listen, then continue based on what was said. Timing is framed as human-adjacent—audio responses are reported as low as 232 milliseconds, averaging around 320 milliseconds. The “O” in “omni” is described as a step toward more natural human-computer interaction, where the system can accept multiple input forms and respond in a fluid back-and-forth rather than forcing turn-taking.

Beyond conversation, the overhaul emphasizes real-world perception and tutoring. A multi-agent style demo shows one AI describing what a camera sees while another AI—unable to view—asks questions and directs the camera based on that description, creating a collaborative workflow. Another segment demonstrates a tutoring flow on Khan Academy math: the user shares a screen, the assistant asks guiding questions without directly giving away answers, and the student works through identifying triangle sides and applying trigonometric relationships to solve for sin(α). The same “flip on the camera” concept is used for quick, practical tasks like translating object names into Spanish.

OpenAI also expands the assistant’s reach to desktop and accessibility. A desktop app concept listens to desktop audio and watches the screen to help in real time—summarizing meetings and assisting with tasks as they happen. Accessibility demos include “Be My Eyes”-style assistance for a blind user, where the assistant describes surroundings and even helps interpret real-time cues like traffic signals and taxi movement.

On performance, OpenAI’s evaluations are presented as a competitive advantage across speech and vision. GPT 40 is described as improving speech recognition versus prior Whisper V3 performance, leading in audio transcription and translation relative to models such as Gemini Ultra, Claude 3 opus, and Llama 3400b (even though some are characterized as unreleased). Vision performance is framed as especially strong, surpassing prior GPT-4 Turbo Vision results and outperforming competing systems in the comparisons shown.

Availability and pricing are part of the strategy: GPT 40 is said to be available in the free tier for users with an account, while Plus users get higher message limits (described as five times higher). The voice live interaction mode is slated for Plus users first. The API is available at half the price of GPT-4 Turbo. Community reactions split between awe—calling it “AGI” or “magic”—and skepticism about definitions, but most agree the combination of speed, multimodality, and accessibility could rapidly reshape education and everyday assistance. The broader implication is clear: OpenAI is pushing toward more human-like interaction now, while also making the underlying capabilities cheaper and more widely accessible to accelerate adoption.

Cornell Notes

OpenAI’s ChatGPT overhaul introduces GPT 40 as a faster, near real-time flagship model with strong multimodal abilities (text, audio, and vision). Demos highlight quicker generation than earlier GPT-4 variants and improved factual output, alongside a new “omni” interaction style that supports interruptible, expressive voice conversations. The new interface rolls out gradually, initially for ChatGPT Plus users, while GPT 40 itself is available in the free tier with an account. Practical capabilities include camera-based perception, collaborative multi-agent workflows, and tutoring-style guidance on Khan Academy without directly giving answers. Accessibility and desktop assistance are also emphasized, including real-time help for blind users and meeting/screen summarization.

What makes GPT 40 different from earlier GPT-4 options in the demos and comparisons?

GPT 40 is presented as working in near real time and generating responses faster than GPT-4 and GPT-4 Turbo. In one example, it produced a comprehensive list of about 60 facts quickly, while an older GPT-4 run generated about 20 facts and took multiple times longer. The model is also positioned as at least as capable as GPT-4/GPT-4 Turbo in overall quality, with additional improvements for non-English performance. It’s also available immediately in the API and described as cheaper (half the price of GPT-4 Turbo).

How does the new “omni” voice experience change interaction style?

The interface is described as interruptible and more conversational: the assistant can stop speaking when the user cuts in, listen, then continue based on the new input. It uses more emotive voice options, including both male and female voices, and is timed to feel human-like, with audio response latency cited as as low as 232 milliseconds and averaging around 320 milliseconds. Inputs can come through text, audio, and images, enabling more natural back-and-forth rather than rigid turn-taking.

What does the camera demo suggest about how the system handles perception and collaboration?

A multi-agent demo shows one AI that can see via camera and describe what it observes (e.g., clothing, lighting, room details), while a second AI cannot see and instead asks questions and directs the camera. The “job” of the non-vision AI is to be helpful by requesting what to look for, while the vision AI supplies the visual context. The interface is described as using snapshots rather than a continuous video feed, processing many images quickly for real-time interaction.

How is tutoring framed differently from typical Q&A?

In the Khan Academy math example, the assistant guides the student by asking the learner to identify triangle parts (opposite, adjacent, hypotenuse) and then apply the sine relationship sin(α)=opposite/hypotenuse. It nudges the student toward the correct steps without simply supplying the final answer immediately. The student works through recognizing the hypotenuse and opposite side, then computes sin(α) as 7/25, matching the assistant’s prompts.

Why do accessibility and desktop demos matter in the rollout plan?

Accessibility demos show the assistant can interpret real-world scenes and help a blind user navigate tasks—describing what’s happening and even interpreting cues like taxi location and movement. Desktop concepts extend the same real-time help to everyday workflows by listening to desktop audio and watching the screen to summarize meetings or solve problems as they occur. Together, these demos position the upgrade as practical infrastructure, not just a chatbot conversation upgrade.

Review Questions

What specific latency numbers are cited for the voice interaction, and how do they support the claim of “real-time” conversation?
In the tutoring demo, what steps does the assistant require the student to perform before arriving at the final trigonometry result?
How does the multi-agent camera demo divide responsibilities between a vision-capable system and a non-vision system?

Key Points

1
GPT 40 is presented as a faster, near real-time flagship model with strong multimodal capability (text, audio, vision) and improved factual output in demos.
2
A new “omni” interaction style enables interruptible, expressive voice conversations, with response timing cited as as low as 232 ms and averaging around 320 ms.
3
The new ChatGPT interface rolls out gradually, initially for ChatGPT Plus users, while GPT 40 itself is described as available in the free tier for account holders.
4
Camera-based features are demonstrated through snapshot-style perception and collaborative workflows where a non-vision AI can direct what the vision AI should look for.
5
Practical use cases include tutoring-style guidance on Khan Academy math and quick translation/object identification via the camera.
6
Desktop and accessibility demos emphasize real-time assistance: listening to desktop audio, watching screens, summarizing meetings, and supporting blind users with scene descriptions.
7
Pricing and availability are positioned to accelerate adoption: GPT 40 is described as cheaper in the API (half the price of GPT-4 Turbo) and Plus users get higher message limits.

Highlights

GPT 40 is shown generating responses in near real time and producing substantially more factual output than earlier GPT-4 runs in the same kind of demo.

Voice interaction is framed as human-like and interruptible—cut in, it stops, listens, then continues based on what was said.

A collaborative camera demo splits roles: one AI sees and describes, while another (without vision) asks questions and directs the camera.

The Khan Academy tutoring example emphasizes guided reasoning—students identify triangle parts and apply sin(α)=opposite/hypotenuse to reach 7/25.

Accessibility and desktop concepts push the system beyond chat into real-time assistance for everyday tasks and navigation.

Topics

GPT 40
Omni Voice
Multimodal Interaction
Real-Time Vision
Accessibility
Desktop Assistant
API Pricing

Mentioned

Andrew Gaal
Matthew Burman
Benjamin
McKay Wrigley
Imran
Alex
Miana
GPT
API
AGI