Google gives their AI Chatbot VISION! Any Good?

TL;DR

Bard now accepts uploaded images alongside text, enabling multimodal conversations.

Briefing Cornell Notes

Briefing

Google’s Bard has added image understanding to its chat experience, turning it into a multimodal assistant that can interpret uploaded pictures alongside text. The update also rolls out a cluster of “catch-up” features—more languages, voice output via text-to-speech, conversation sharing, and ways to export code—while the headline change is Bard’s new ability to “see” images and respond to what’s in them.

In the transcript, Bard’s new image capability is tested with several examples. When given a 3D-rendered lemon wearing virtual reality glasses, Bard produces a detailed description of the scene and then offers plausible interpretations about virtual reality and alternate worlds. The tester pushes back on factual accuracy, noting Bard can hallucinate details—such as inventing context about who created the image—though it still lands close enough to the intended subject matter. The workflow also includes response controls: Bard can generate shorter or longer answers and adjust tone toward “more professional” or “more casual,” with the modified response replacing the prior draft while keeping access to earlier iterations.

Bard’s image understanding appears stronger when the task is straightforward. With a photo of a dog, Bard correctly identifies the breed as a shih tzu and repeats the correct guess across multiple attempts, suggesting it can reliably extract visual cues like coat color and facial features. Bard also handles creative prompts tied to the image, such as writing an A-B-B poem about the dog, and the text-to-speech voice output is described as serviceable—better than annoying, though not on par with top-tier voice tools.

Where the system shows limits is in interpreting complex context and relationships in memes. In several meme tests, Bard reads the visible text accurately but repeatedly misses the intended meaning or misidentifies who is who—confusing “Google Bard” with other characters in the joke and producing explanations that don’t match the punchline. Even when it gets the gist of what’s happening (like discomfort from a straw placement meme), it struggles with the deeper relational logic that makes the humor work.

Overall, the update positions Bard as a credible image-capable assistant for tasks involving description, basic visual classification, and reading text in images. But the transcript draws a clear line between competent “what’s in the picture” recognition and weaker “why it matters” reasoning—especially when images require nuanced interpretation. The update is also framed as part of a broader competitive race with OpenAI’s GPT-4 vision and other models such as Claude 2, with Bard’s new multimodal features presented as a meaningful step forward, even if it still lags in higher-level understanding.

Cornell Notes

Bard’s latest update adds multimodal capability: it can take images uploaded alongside text and respond based on what it sees. The transcript tests this with a mix of straightforward and tricky examples. Bard reliably describes scenes and can correctly identify a dog breed (shih tzu) from a photo, and it can read text embedded in images. It also offers practical controls like shortening/lengthening responses and tone adjustments, plus voice output through text-to-speech. The main weakness appears in nuanced interpretation—Bard may read meme text accurately but often misses the intended relationships and punchlines, showing that image recognition is stronger than contextual reasoning.

What new capability is treated as the biggest change in Bard, and why does it matter?

Bard now supports image input alongside text, effectively giving it “eyes” for multimodal conversations. That matters because it shifts Bard from text-only assistance to tasks like describing images, extracting embedded text, and making sense of visual context—capabilities that users had been waiting for after similar vision features appeared in OpenAI’s GPT-4.

How does Bard perform on image description versus factual accuracy?

In the lemon-with-VR-glasses example, Bard produces a detailed description (a 3D-rendered lemon wearing black VR glasses, lens reflections, and a small label that it misreads). The transcript also highlights hallucination risk: Bard invents incorrect claims about the image’s creator and other details, even when the overall description is close to the intended subject.

What evidence suggests Bard can handle some visual classification tasks well?

With a dog photo, Bard correctly identifies the breed as a shih tzu and repeats the correct guess across multiple attempts. It also provides a coherent description of visible traits (brown and white coloring, facial features) and then generates a creative output (an A-B-B poem) based on the image.

What kinds of image tasks cause Bard to struggle most?

Meme interpretation is the weak spot. Bard often reads the meme text correctly but misses the intended meaning, confusing characters/roles in the joke and producing explanations that don’t match the punchline. The transcript shows repeated failures in understanding relationships and context beyond the literal text.

What user-facing controls and features accompany the image update?

Bard adds response modification options—shorter/longer and tone shifts like more professional or more casual—where the modified response replaces the prior draft while earlier drafts remain accessible. Other additions mentioned include voice output via text-to-speech, recent/pinned threads, sharing conversations, and exporting Python code to Replit.

How is the voice output quality described?

Text-to-speech is described as “serviceable”: not annoying and reasonably clear, but not comparable to premium voice tools like ElevenLabs.

Review Questions

In what situations does Bard’s image understanding look reliable, and what specific examples in the transcript support that?
Why might Bard read meme text accurately yet still fail to capture the joke’s intent?
How do the response modification controls (shorter/longer/tone) change the user’s workflow when working with image-based prompts?

Key Points

1
Bard now accepts uploaded images alongside text, enabling multimodal conversations.
2
The update also adds voice output (text-to-speech), more languages, conversation sharing, and pinned/recent threads.
3
Bard’s image descriptions can be detailed, but it can still hallucinate incorrect facts about context or authorship.
4
Visual classification appears stronger than nuanced interpretation: the dog breed identification (shih tzu) is consistently correct in the tests.
5
Bard reads text in images effectively, but meme reasoning often fails when the joke depends on relationships and context beyond the literal words.
6
Response controls let users adjust length and tone, with modified answers replacing the prior draft while preserving earlier iterations.
7
The transcript frames Bard’s progress as meaningful but still behind top competitors for higher-level reasoning and more advanced multimodal understanding.

Highlights

Bard’s most consequential update is image input: upload a picture, then ask questions or prompt creative outputs based on what’s shown.

The system can correctly identify a shih tzu from a photo and generate an A-B-B poem from the image content.

Meme tests show a recurring pattern: accurate text reading paired with frequent failures to understand the intended punchline or character relationships.

Topics

Bard Vision
Multimodal AI
Image Understanding
Text-to-Speech
LLM Hallucinations