Open AI Humbles EVERYONE. This Chatbot FEELS Alive!
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT 40 is presented as a faster, near real-time flagship model with strong multimodal capability (text, audio, vision) and improved factual output in demos.
Briefing
OpenAI’s latest ChatGPT overhaul centers on GPT-4-class performance delivered in near real time—plus a new “omni” interaction style that can take in text, audio, and images while responding with human-like timing and emotion. The flagship model, introduced as GPT 40, is positioned as faster than prior GPT-4 and GPT-4 Turbo options, with demos showing markedly quicker generation and stronger factual recall (including an example where GPT 40 produced about 60 facts versus roughly 20 from the older GPT-4 in a slower run). It also lands in the API immediately and is described as cheaper, reinforcing the push to make the most capable model broadly usable.
The most eye-catching change is the new voice-and-vision interface for ChatGPT, rolling out gradually over the next few weeks and initially limited to ChatGPT Plus users. In the demo, the assistant speaks with more expressive, multi-voice options (including both male and female voices), and the conversation feels interruptible: it can stop when the user cuts in, listen, then continue based on what was said. Timing is framed as human-adjacent—audio responses are reported as low as 232 milliseconds, averaging around 320 milliseconds. The “O” in “omni” is described as a step toward more natural human-computer interaction, where the system can accept multiple input forms and respond in a fluid back-and-forth rather than forcing turn-taking.
Beyond conversation, the overhaul emphasizes real-world perception and tutoring. A multi-agent style demo shows one AI describing what a camera sees while another AI—unable to view—asks questions and directs the camera based on that description, creating a collaborative workflow. Another segment demonstrates a tutoring flow on Khan Academy math: the user shares a screen, the assistant asks guiding questions without directly giving away answers, and the student works through identifying triangle sides and applying trigonometric relationships to solve for sin(α). The same “flip on the camera” concept is used for quick, practical tasks like translating object names into Spanish.
OpenAI also expands the assistant’s reach to desktop and accessibility. A desktop app concept listens to desktop audio and watches the screen to help in real time—summarizing meetings and assisting with tasks as they happen. Accessibility demos include “Be My Eyes”-style assistance for a blind user, where the assistant describes surroundings and even helps interpret real-time cues like traffic signals and taxi movement.
On performance, OpenAI’s evaluations are presented as a competitive advantage across speech and vision. GPT 40 is described as improving speech recognition versus prior Whisper V3 performance, leading in audio transcription and translation relative to models such as Gemini Ultra, Claude 3 opus, and Llama 3400b (even though some are characterized as unreleased). Vision performance is framed as especially strong, surpassing prior GPT-4 Turbo Vision results and outperforming competing systems in the comparisons shown.
Availability and pricing are part of the strategy: GPT 40 is said to be available in the free tier for users with an account, while Plus users get higher message limits (described as five times higher). The voice live interaction mode is slated for Plus users first. The API is available at half the price of GPT-4 Turbo. Community reactions split between awe—calling it “AGI” or “magic”—and skepticism about definitions, but most agree the combination of speed, multimodality, and accessibility could rapidly reshape education and everyday assistance. The broader implication is clear: OpenAI is pushing toward more human-like interaction now, while also making the underlying capabilities cheaper and more widely accessible to accelerate adoption.
Cornell Notes
OpenAI’s ChatGPT overhaul introduces GPT 40 as a faster, near real-time flagship model with strong multimodal abilities (text, audio, and vision). Demos highlight quicker generation than earlier GPT-4 variants and improved factual output, alongside a new “omni” interaction style that supports interruptible, expressive voice conversations. The new interface rolls out gradually, initially for ChatGPT Plus users, while GPT 40 itself is available in the free tier with an account. Practical capabilities include camera-based perception, collaborative multi-agent workflows, and tutoring-style guidance on Khan Academy without directly giving answers. Accessibility and desktop assistance are also emphasized, including real-time help for blind users and meeting/screen summarization.
What makes GPT 40 different from earlier GPT-4 options in the demos and comparisons?
How does the new “omni” voice experience change interaction style?
What does the camera demo suggest about how the system handles perception and collaboration?
How is tutoring framed differently from typical Q&A?
Why do accessibility and desktop demos matter in the rollout plan?
Review Questions
- What specific latency numbers are cited for the voice interaction, and how do they support the claim of “real-time” conversation?
- In the tutoring demo, what steps does the assistant require the student to perform before arriving at the final trigonometry result?
- How does the multi-agent camera demo divide responsibilities between a vision-capable system and a non-vision system?
Key Points
- 1
GPT 40 is presented as a faster, near real-time flagship model with strong multimodal capability (text, audio, vision) and improved factual output in demos.
- 2
A new “omni” interaction style enables interruptible, expressive voice conversations, with response timing cited as as low as 232 ms and averaging around 320 ms.
- 3
The new ChatGPT interface rolls out gradually, initially for ChatGPT Plus users, while GPT 40 itself is described as available in the free tier for account holders.
- 4
Camera-based features are demonstrated through snapshot-style perception and collaborative workflows where a non-vision AI can direct what the vision AI should look for.
- 5
Practical use cases include tutoring-style guidance on Khan Academy math and quick translation/object identification via the camera.
- 6
Desktop and accessibility demos emphasize real-time assistance: listening to desktop audio, watching screens, summarizing meetings, and supporting blind users with scene descriptions.
- 7
Pricing and availability are positioned to accelerate adoption: GPT 40 is described as cheaper in the API (half the price of GPT-4 Turbo) and Plus users get higher message limits.