How Well Can GPT-4 See? And the 5 Upgrades That Are Next
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT-4’s vision progress is framed as moving toward robust interpretation, including medical imagery cues, humor understanding, and reading text embedded in complex visuals.
Briefing
GPT-4’s vision and multimodal upgrades are converging into a single capability stack: models that can read complex visuals (including text and diagrams), translate them into structured outputs (like code and 3D), and then interact with the real world through speech and—eventually—embodied robots. The practical significance is that “seeing” is no longer a standalone feature; it’s becoming a bridge between language and physical or simulated artifacts, from medical imagery to 3D scenes.
On images, GPT-4’s performance is framed as moving beyond earlier “tricks” and toward robust interpretation. Medical imagery is highlighted as a key test: GPT-4 can identify elements consistent with a brain tumor in complex scans, though it doesn’t deliver full diagnoses. The transcript also points to an OpenAI paper released days earlier that evaluated GPT-4 on medical questions and found strong results even without vision—then noted that adding images and graphs reduced average performance. That tension matters: multimodality can add value, but it also introduces variability that still needs to be engineered.
Vision is also presented as a cognitive flex rather than just a perception tool. GPT-4 can infer why images are funny, suggesting it can connect visual cues to human context. At the same time, face recognition is explicitly constrained for privacy reasons, with uncertainty left about whether jailbreaks could bypass that.
A major milestone is text extraction from images. The transcript emphasizes GPT-4’s score on the TextVQA benchmark, where it reportedly reaches 78 versus 72 for the prior state of the art. The comparison to human performance (85 in the cited table) is used to show the gap is narrowing: GPT-4 is only a few points behind average human accuracy on reading text embedded in complex images. This matters because it turns screenshots, diagrams, and labeled visuals into machine-readable information that can be searched, summarized, or converted into other formats.
From there, the narrative shifts to “bleeding” improvements across modalities. The same underlying progress that enables vision also supports converting handwriting into websites and natural language into Blender code for detailed 3D models with physics. The transcript argues that the boundaries between text, images, 3D, and embodiment are dissolving—especially when language can be embedded inside a model so users can query higher-level concepts like “yellow,” “utensils,” or “electricity” from 2D-captured 3D fields (with noted failures such as recognizing “ramen”).
Audio and speech are treated as the next interface layer. Conformer is presented as outperforming Whisper on speech recognition error rates, with fewer mistakes in the cited chart and an invitation to test via an API link. Finally, the transcript connects these capabilities to physical robotics. It references Sam Altman’s roadmap—legal reading, medical advice, assembly-line work, and companionship—and notes OpenAI’s earlier robotics team was disbanded. Still, a $23 million investment in 1X (a humanoid-robot startup) is cited, alongside demonstrations of non-humanoid robots that can climb, balance, and operate buttons. The through-line: text, vision, audio, 3D, and embodiment are increasingly complementary, and their synergy could be the real inflection point.
Cornell Notes
GPT-4’s multimodal upgrades are converging into a system that can “see” and then act: it can interpret complex images, read text and diagrams, and translate visual or written inputs into structured outputs like code and 3D models. TextVQA results are used to quantify progress, with GPT-4 scoring 78 versus 72 for the prior state of the art, narrowing the gap to human performance (85). The transcript also notes that adding images to medical question answering can sometimes reduce average performance, even when vision helps in specific cases like tumor-related cues. Speech interfaces are advancing too, with Conformer reported to reduce recognition errors compared with Whisper. Together, these improvements suggest language is becoming a control layer across vision, audio, 3D, and—eventually—embodied robotics.
What evidence is given that GPT-4’s vision is improving beyond basic image understanding?
Why does the transcript emphasize TextVQA, and what does the score comparison imply?
How does the transcript reconcile “vision helps” with the claim that vision can hurt medical question performance?
What does “text to 3D” and “language-embedded search” mean in the examples given?
How are speech recognition upgrades positioned as part of the same multimodal trend?
What role does embodiment play, and what real-world signals are cited?
Review Questions
- Which benchmark is used to quantify GPT-4’s ability to read text from complex images, and what are the reported scores for GPT-4, the prior state of the art, and humans?
- What medical evaluation result is described as dropping when images and graphs are included, and how does that coexist with the claim that GPT-4 can still spot tumor-related elements in a medical image?
- How does the transcript connect language-based control to 3D generation and concept search, and what example of failure (e.g., a specific word) is mentioned?
Key Points
- 1
GPT-4’s vision progress is framed as moving toward robust interpretation, including medical imagery cues, humor understanding, and reading text embedded in complex visuals.
- 2
TextVQA is used as a quantitative milestone for text-in-image understanding, with GPT-4 reported at 78 versus 72 for the prior state of the art and 85 for human performance.
- 3
Multimodality can be uneven: medical question performance may drop on average when images and graphs are added, even if vision helps in specific visual tasks.
- 4
Language is increasingly positioned as a control layer that can generate and query 3D content, including converting handwriting to websites and natural language into Blender code for physics-enabled 3D models.
- 5
Speech interfaces are advancing in parallel, with Conformer reported to reduce recognition errors compared with Whisper.
- 6
Embodiment is treated as the next frontier, supported by investment in humanoid robotics (1X) and demonstrations of non-humanoid robots that can climb, balance, and operate buttons.
- 7
The central thesis is synergy: improvements across text, vision, audio, 3D, and robotics are starting to merge, which could be more transformative than any single upgrade alone.