GPT 4's Hidden Feature! We've been Missing Out on This!
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Bing Chat’s vision mode can interpret images and explain humor, but it may hallucinate or misread details that aren’t visible.
Briefing
GPT-4’s once-promised “see and explain” capability is still out there—but it’s been split across different products, with major differences in how much the models are allowed to look. The core finding from these tests: Bing Chat’s built-in vision mode can interpret images in a more human-like way than the original GPT-4 demo in some cases, yet it’s constrained by privacy and safety rules (notably face blurring). Meanwhile, an Nvidia-run vision model can be run locally and feels less restricted, but it struggles with context and fine details.
The transcript starts by revisiting OpenAI’s early GPT-4 vision demo, where the system could parse complex images panel-by-panel and even read graphs to answer math questions. That capability never landed in ChatGPT in the same form, despite file uploads. Instead, the same underlying idea appears to have been quietly integrated into Bing Chat: users can upload images directly, and the model will describe what it sees and explain jokes, objects, and scene details.
In the first comparison, Bing Chat handles a “VGA cable plugged into a smartphone charging port” meme with a generally solid explanation of the humor—absurdity from the mismatch between an old, bulky VGA connector and a modern phone port. But Bing also produces errors that don’t appear in the image, including misread or invented details about labels and connector specifics. The math/graph test shows a different failure mode: Bing appears to pull a related value from web search rather than reliably extracting the exact numbers from the chart, even though the final arithmetic result still matches.
The “ironing man on a taxi” meme produces the sharpest contrast. Bing Chat gives a longer, more structured breakdown and even references image metadata (the timestamp and location), while also making mistakes—such as confusing straps and power assumptions. Still, the transcript argues Bing’s overall interpretation is closer to what a human would do than the earlier GPT-4 response, which the tester says contained its own inaccuracies and a hallucinated “pink filter” explanation.
For the chicken nugget “world map” meme, Bing Chat delivers an explanation that aligns with the intended joke: the caption promises awe at Earth-from-space beauty, while the image is a crude, edible substitute. The transcript rates Bing’s vision-and-explanation performance around a “B minus,” attributing the gap to safety restrictions that likely reduce what the model can reliably perceive.
Finally, the transcript introduces an Nvidia vision model that can run on a home PC with high-end requirements (Windows 11+, 64GB RAM, 60GB disk, and an RTX 4090-class GPU). This model can describe faces without the same level of face blurring, but it’s more limited in language/OCR and can miss contextual relationships—like the phone still being in its packaging. The takeaway is practical: Bing is convenient but privacy-limited; the Nvidia option is more controllable but less context-accurate; and a future where AI assistants can “see” and act around the home depends on closing these reliability and safety gaps.
Cornell Notes
Bing Chat’s vision mode appears to bring back much of the “GPT-4 can see and explain” promise, but with safety constraints like face blurring. In side-by-side tests on memes and a chart problem, Bing often explains the intended humor well (notably the chicken nugget “Earth” meme) while still making image-specific mistakes and sometimes relying on outside search rather than extracting exact chart values. The earlier GPT-4 demo is described as more capable in theory, but the transcript claims Bing’s practical interpretations can be closer to human reasoning in some cases. A separate Nvidia vision model can run locally and is less restricted about faces, yet it struggles with context and text recognition. The differences matter because they determine whether “seeing AI” can be trusted for real tasks.
Why does Bing Chat’s vision mode sometimes fail even when it gives a good explanation?
What happened in the graph/math test, and what does it imply about how the model uses visual data?
How did Bing Chat handle the “ironing man on a taxi” meme compared with the earlier GPT-4 response?
What tradeoff does the transcript highlight between Bing’s convenience and its privacy limits?
What are the practical strengths and weaknesses of the Nvidia local vision model?
Why does the transcript argue these differences matter for future AI assistants?
Review Questions
- In the chart test, how did Bing Chat still arrive at the correct final number despite the transcript claiming it didn’t extract the exact chart value?
- Which meme did Bing Chat explain most convincingly, and what specific contrast made the humor work?
- What privacy behavior limited Bing Chat’s ability to answer questions about a person’s appearance, and how did the Nvidia model differ?
Key Points
- 1
Bing Chat’s vision mode can interpret images and explain humor, but it may hallucinate or misread details that aren’t visible.
- 2
Bing’s chart/math performance can mix visual interpretation with external web lookup, creating correctness that may not be grounded in the provided chart.
- 3
The “ironing man on a taxi” test highlights both strengths (structured scene reasoning, metadata use) and weaknesses (incorrect assumptions about straps/power).
- 4
Face privacy controls in Bing Chat can prevent it from describing people in uploaded images due to automatic face blurring.
- 5
A locally runnable Nvidia vision model offers less restricted face handling, but it’s more limited in context understanding and text recognition.
- 6
The practical goal of “seeing AI” for home assistance depends on improving visual grounding, OCR, and safe access without over-blocking useful perception.