Get AI summaries of any video or article — Sign up free
How Well Can GPT-4 See? And the 5 Upgrades That Are Next thumbnail

How Well Can GPT-4 See? And the 5 Upgrades That Are Next

AI Explained·
5 min read

Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

GPT-4’s vision progress is framed as moving toward robust interpretation, including medical imagery cues, humor understanding, and reading text embedded in complex visuals.

Briefing

GPT-4’s vision and multimodal upgrades are converging into a single capability stack: models that can read complex visuals (including text and diagrams), translate them into structured outputs (like code and 3D), and then interact with the real world through speech and—eventually—embodied robots. The practical significance is that “seeing” is no longer a standalone feature; it’s becoming a bridge between language and physical or simulated artifacts, from medical imagery to 3D scenes.

On images, GPT-4’s performance is framed as moving beyond earlier “tricks” and toward robust interpretation. Medical imagery is highlighted as a key test: GPT-4 can identify elements consistent with a brain tumor in complex scans, though it doesn’t deliver full diagnoses. The transcript also points to an OpenAI paper released days earlier that evaluated GPT-4 on medical questions and found strong results even without vision—then noted that adding images and graphs reduced average performance. That tension matters: multimodality can add value, but it also introduces variability that still needs to be engineered.

Vision is also presented as a cognitive flex rather than just a perception tool. GPT-4 can infer why images are funny, suggesting it can connect visual cues to human context. At the same time, face recognition is explicitly constrained for privacy reasons, with uncertainty left about whether jailbreaks could bypass that.

A major milestone is text extraction from images. The transcript emphasizes GPT-4’s score on the TextVQA benchmark, where it reportedly reaches 78 versus 72 for the prior state of the art. The comparison to human performance (85 in the cited table) is used to show the gap is narrowing: GPT-4 is only a few points behind average human accuracy on reading text embedded in complex images. This matters because it turns screenshots, diagrams, and labeled visuals into machine-readable information that can be searched, summarized, or converted into other formats.

From there, the narrative shifts to “bleeding” improvements across modalities. The same underlying progress that enables vision also supports converting handwriting into websites and natural language into Blender code for detailed 3D models with physics. The transcript argues that the boundaries between text, images, 3D, and embodiment are dissolving—especially when language can be embedded inside a model so users can query higher-level concepts like “yellow,” “utensils,” or “electricity” from 2D-captured 3D fields (with noted failures such as recognizing “ramen”).

Audio and speech are treated as the next interface layer. Conformer is presented as outperforming Whisper on speech recognition error rates, with fewer mistakes in the cited chart and an invitation to test via an API link. Finally, the transcript connects these capabilities to physical robotics. It references Sam Altman’s roadmap—legal reading, medical advice, assembly-line work, and companionship—and notes OpenAI’s earlier robotics team was disbanded. Still, a $23 million investment in 1X (a humanoid-robot startup) is cited, alongside demonstrations of non-humanoid robots that can climb, balance, and operate buttons. The through-line: text, vision, audio, 3D, and embodiment are increasingly complementary, and their synergy could be the real inflection point.

Cornell Notes

GPT-4’s multimodal upgrades are converging into a system that can “see” and then act: it can interpret complex images, read text and diagrams, and translate visual or written inputs into structured outputs like code and 3D models. TextVQA results are used to quantify progress, with GPT-4 scoring 78 versus 72 for the prior state of the art, narrowing the gap to human performance (85). The transcript also notes that adding images to medical question answering can sometimes reduce average performance, even when vision helps in specific cases like tumor-related cues. Speech interfaces are advancing too, with Conformer reported to reduce recognition errors compared with Whisper. Together, these improvements suggest language is becoming a control layer across vision, audio, 3D, and—eventually—embodied robotics.

What evidence is given that GPT-4’s vision is improving beyond basic image understanding?

The transcript points to multiple concrete tasks: interpreting complex medical imagery to spot elements consistent with a brain tumor; inferring why images are funny; and reading text from images via the TextVQA benchmark. It also notes constraints like no face recognition for privacy reasons. The strongest quantified claim is TextVQA, where GPT-4 reportedly scores 78 versus 72 for the previous state of the art.

Why does the transcript emphasize TextVQA, and what does the score comparison imply?

TextVQA is framed as a direct test of reading text embedded in complex images (not clean, isolated text). The cited numbers—GPT-4 at 78 and the previous state of the art at 72—suggest meaningful progress in OCR-like understanding. The transcript also compares to human performance at 85, implying GPT-4 is close enough to be practically useful while still leaving room for improvement.

How does the transcript reconcile “vision helps” with the claim that vision can hurt medical question performance?

It cites an OpenAI medical evaluation paper where GPT-4 performed strongly on medical questions without vision, but average results dropped when images and graphs were included. The transcript then separately highlights that GPT-4 can still identify tumor-related elements in a medical image. The takeaway is that multimodality can be uneven: it may help in targeted visual interpretation while still complicating overall reasoning or evaluation when images are added broadly.

What does “text to 3D” and “language-embedded search” mean in the examples given?

The transcript describes converting natural language into Blender code to generate detailed 3D models with physics, and converting handwriting into a website. It then describes a 2D-captured dense 3D field where the model supports searching for higher-level concepts like “yellow,” “utensils,” or “electricity” using language. It also notes imperfections, such as difficulty recognizing “ramen,” indicating the concept grounding is not fully reliable yet.

How are speech recognition upgrades positioned as part of the same multimodal trend?

Speech is treated as the next interaction layer. Conformer is presented as better than Whisper, with a chart indicating fewer errors in speech recognition. The transcript encourages testing via an API link and mentions personal testing on a 12-minute transcript with only a handful of mistakes, reinforcing the practical usability angle.

What role does embodiment play, and what real-world signals are cited?

Embodiment is portrayed as the physical extension of language and perception. The transcript references Sam Altman’s roadmap (legal reading, medical advice, assembly-line work, companionship) and notes OpenAI’s robotics team was disbanded. As a signal of continued momentum, it cites a $23 million investment in 1X, which builds human-like robots, and includes demonstrations of non-humanoid robots that can climb, balance, and operate buttons. The argument is that synergy across text, vision, audio, 3D, and robotics could matter more than any single capability alone.

Review Questions

  1. Which benchmark is used to quantify GPT-4’s ability to read text from complex images, and what are the reported scores for GPT-4, the prior state of the art, and humans?
  2. What medical evaluation result is described as dropping when images and graphs are included, and how does that coexist with the claim that GPT-4 can still spot tumor-related elements in a medical image?
  3. How does the transcript connect language-based control to 3D generation and concept search, and what example of failure (e.g., a specific word) is mentioned?

Key Points

  1. 1

    GPT-4’s vision progress is framed as moving toward robust interpretation, including medical imagery cues, humor understanding, and reading text embedded in complex visuals.

  2. 2

    TextVQA is used as a quantitative milestone for text-in-image understanding, with GPT-4 reported at 78 versus 72 for the prior state of the art and 85 for human performance.

  3. 3

    Multimodality can be uneven: medical question performance may drop on average when images and graphs are added, even if vision helps in specific visual tasks.

  4. 4

    Language is increasingly positioned as a control layer that can generate and query 3D content, including converting handwriting to websites and natural language into Blender code for physics-enabled 3D models.

  5. 5

    Speech interfaces are advancing in parallel, with Conformer reported to reduce recognition errors compared with Whisper.

  6. 6

    Embodiment is treated as the next frontier, supported by investment in humanoid robotics (1X) and demonstrations of non-humanoid robots that can climb, balance, and operate buttons.

  7. 7

    The central thesis is synergy: improvements across text, vision, audio, 3D, and robotics are starting to merge, which could be more transformative than any single upgrade alone.

Highlights

GPT-4’s TextVQA score is cited as 78, up from 72 for the prior state of the art, narrowing the gap to human performance (85).
A medical evaluation result is described where adding images and graphs reduced GPT-4’s average performance, even though vision can still help identify tumor-related elements in complex scans.
Language is shown as more than a captioning tool—examples include querying abstract concepts inside a 3D representation built from 2D phone captures.
Conformer is presented as outperforming Whisper on speech recognition error rates, positioning voice as a faster interface into multimodal systems.
A $23 million investment in 1X and recent robot demonstrations are used to argue that embodiment is arriving alongside text, vision, and audio advances.

Topics

Mentioned