GPT 4's Hidden Feature! We've been Missing Out on This!

TL;DR

Bing Chat’s vision mode can interpret images and explain humor, but it may hallucinate or misread details that aren’t visible.

Briefing Cornell Notes

Briefing

GPT-4’s once-promised “see and explain” capability is still out there—but it’s been split across different products, with major differences in how much the models are allowed to look. The core finding from these tests: Bing Chat’s built-in vision mode can interpret images in a more human-like way than the original GPT-4 demo in some cases, yet it’s constrained by privacy and safety rules (notably face blurring). Meanwhile, an Nvidia-run vision model can be run locally and feels less restricted, but it struggles with context and fine details.

The transcript starts by revisiting OpenAI’s early GPT-4 vision demo, where the system could parse complex images panel-by-panel and even read graphs to answer math questions. That capability never landed in ChatGPT in the same form, despite file uploads. Instead, the same underlying idea appears to have been quietly integrated into Bing Chat: users can upload images directly, and the model will describe what it sees and explain jokes, objects, and scene details.

In the first comparison, Bing Chat handles a “VGA cable plugged into a smartphone charging port” meme with a generally solid explanation of the humor—absurdity from the mismatch between an old, bulky VGA connector and a modern phone port. But Bing also produces errors that don’t appear in the image, including misread or invented details about labels and connector specifics. The math/graph test shows a different failure mode: Bing appears to pull a related value from web search rather than reliably extracting the exact numbers from the chart, even though the final arithmetic result still matches.

The “ironing man on a taxi” meme produces the sharpest contrast. Bing Chat gives a longer, more structured breakdown and even references image metadata (the timestamp and location), while also making mistakes—such as confusing straps and power assumptions. Still, the transcript argues Bing’s overall interpretation is closer to what a human would do than the earlier GPT-4 response, which the tester says contained its own inaccuracies and a hallucinated “pink filter” explanation.

For the chicken nugget “world map” meme, Bing Chat delivers an explanation that aligns with the intended joke: the caption promises awe at Earth-from-space beauty, while the image is a crude, edible substitute. The transcript rates Bing’s vision-and-explanation performance around a “B minus,” attributing the gap to safety restrictions that likely reduce what the model can reliably perceive.

Finally, the transcript introduces an Nvidia vision model that can run on a home PC with high-end requirements (Windows 11+, 64GB RAM, 60GB disk, and an RTX 4090-class GPU). This model can describe faces without the same level of face blurring, but it’s more limited in language/OCR and can miss contextual relationships—like the phone still being in its packaging. The takeaway is practical: Bing is convenient but privacy-limited; the Nvidia option is more controllable but less context-accurate; and a future where AI assistants can “see” and act around the home depends on closing these reliability and safety gaps.

Cornell Notes

Bing Chat’s vision mode appears to bring back much of the “GPT-4 can see and explain” promise, but with safety constraints like face blurring. In side-by-side tests on memes and a chart problem, Bing often explains the intended humor well (notably the chicken nugget “Earth” meme) while still making image-specific mistakes and sometimes relying on outside search rather than extracting exact chart values. The earlier GPT-4 demo is described as more capable in theory, but the transcript claims Bing’s practical interpretations can be closer to human reasoning in some cases. A separate Nvidia vision model can run locally and is less restricted about faces, yet it struggles with context and text recognition. The differences matter because they determine whether “seeing AI” can be trusted for real tasks.

Why does Bing Chat’s vision mode sometimes fail even when it gives a good explanation?

Bing can correctly identify the intended joke in an image, but it may still hallucinate details that aren’t present. In the VGA/phone meme, Bing explains the humor (old VGA connector vs modern phone port) yet invents or misreads specifics like labels and connector pin layout. That pattern suggests the model is constrained by safety/guardrails and may not reliably ground every detail to pixels, even when the overall interpretation is right.

What happened in the graph/math test, and what does it imply about how the model uses visual data?

For the chart question, Bing produced the correct final sum but appears to have sourced one of the needed values (Western Asia meat consumption) from web search rather than extracting it directly from the provided chart. The transcript notes the mismatch: it “looked up” a related number and still landed on the same arithmetic result. That implies the system may blend visual reading with external retrieval, which can be risky when exact chart values matter.

How did Bing Chat handle the “ironing man on a taxi” meme compared with the earlier GPT-4 response?

Bing offered a more detailed, human-like breakdown—addressing whether the man is strapped to the taxi, how the ironing board and power might work, and it even referenced metadata indicating the image was taken at 11:30 a.m. in New York City. The transcript also credits Bing with avoiding a specific hallucination that the earlier GPT-4 answer made about a pink color filter, though Bing still made its own mistakes (e.g., interpreting straps and power assumptions).

What tradeoff does the transcript highlight between Bing’s convenience and its privacy limits?

Bing can analyze uploaded images by drag-and-drop, but it applies a privacy blur that hides faces. When the tester uploaded a personal photo and asked what the person looked like, Bing refused because the face was blurred. The transcript frames this as a major limitation for anyone who wants unrestricted face understanding.

What are the practical strengths and weaknesses of the Nvidia local vision model?

The Nvidia model can run on a home machine with demanding hardware (Windows 11+, 64GB RAM, 60GB disk, RTX 4090-class GPU). It can describe faces without the same face-blurring behavior, and it can handle some image understanding like the chicken nugget world map meme. But it struggles with context correlations (e.g., charging a phone with a blue USB cable while the phone is still in its package) and it’s more limited on reading text/OCR (the transcript says it found no SDXL text).

Why does the transcript argue these differences matter for future AI assistants?

The envisioned future is AI bots that can recognize objects and actions around the home—fetching items, doing chores, and generating meal ideas from a pantry photo. Those use cases depend on reliable visual grounding and safe-but-functional perception. If models hallucinate details, misread charts, or can’t interpret faces/text when needed, real-world assistance becomes less trustworthy.

Review Questions

In the chart test, how did Bing Chat still arrive at the correct final number despite the transcript claiming it didn’t extract the exact chart value?
Which meme did Bing Chat explain most convincingly, and what specific contrast made the humor work?
What privacy behavior limited Bing Chat’s ability to answer questions about a person’s appearance, and how did the Nvidia model differ?

Key Points

1
Bing Chat’s vision mode can interpret images and explain humor, but it may hallucinate or misread details that aren’t visible.
2
Bing’s chart/math performance can mix visual interpretation with external web lookup, creating correctness that may not be grounded in the provided chart.
3
The “ironing man on a taxi” test highlights both strengths (structured scene reasoning, metadata use) and weaknesses (incorrect assumptions about straps/power).
4
Face privacy controls in Bing Chat can prevent it from describing people in uploaded images due to automatic face blurring.
5
A locally runnable Nvidia vision model offers less restricted face handling, but it’s more limited in context understanding and text recognition.
6
The practical goal of “seeing AI” for home assistance depends on improving visual grounding, OCR, and safe access without over-blocking useful perception.

Highlights

Bing Chat often nails the intended joke structure—especially the chicken nugget “Earth from space” meme—by matching caption expectation to the image’s absurd reality.

Even when Bing produces a correct math answer, the transcript suggests it may not reliably extract numbers from the chart itself.

Bing’s vision includes metadata awareness (timestamp/location), but it still makes scene-level mistakes about how objects relate.

A local Nvidia vision model can describe faces without the same blurring, yet it struggles with contextual correlations and OCR.

Topics

GPT-4 Vision
Bing Chat Vision
Local Vision Model
Image Hallucinations
Privacy Blurring

Mentioned

OpenAI
ChatGPT
Bing Chat
Nvidia
RTX 4090
RTX 4080
Elecom
SDXL
MattVidPro
GPT-4
VGA
OCR
SDXL