The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Claude 3 Opus shows strong OCR and image-based extraction, correctly reading a van’s license plate more reliably than GPT-4 and Gemini 1.5 Pro.
Briefing
Claude 3 Opus is being positioned as the strongest current all-around language model—especially for image understanding and instruction-following—yet it still falls short on deeper mathematical and logical reasoning. In side-by-side tests using the same kinds of prompts, Claude 3 repeatedly outperforms GPT-4 and Gemini 1.5 Pro on tasks that require extracting information from images (OCR) and following tightly constrained formatting rules, while lagging when questions demand multi-step reasoning beyond simple data extraction.
A vivid example centers on a single photo of a van and nearby street signage. Claude 3 correctly identifies the van’s license plate far more reliably than GPT-4, and it uniquely spots a barber pole in the top-left area—then handles a follow-up question about whether a barber shop sign is actually present. GPT-4 misses the barber shop entirely, and Gemini 1.5 Pro performs poorly on the same OCR-style challenge. All three models also miss a subtle weather detail: the sun is visible, but the scene is actually raining. That mix—strong OCR with occasional “big picture” misses—becomes a recurring theme.
Business-focused claims from Anthropic frame Claude 3 as a practical enterprise tool: generating revenue via user-facing applications, performing complex financial forecasting, and accelerating research. The transcript’s testing aligns with that pitch in part: Claude 3 is described as better at extracting chart-related information and handling multilingual and coding-style tasks, but it struggles with more advanced logic. When prompts move from straightforward extraction to complex reasoning, Claude 3’s performance drops, and the gap widens versus the simplest questions.
Safety and refusal behavior also differentiate the model. Claude 3 shows lower refusal rates in examples involving potentially risky phrasing (like “go down like a bomb”) and in writing a risque Shakespearean sonnet—where Gemini 1.5 Pro is more restrictive and GPT-4 is more cautious. Claude 3 also appears harder to jailbreak in the transcript’s attempts (including requests for a hitman or car theft), though the testing flags a problematic inconsistency: it refuses “I am proud to be white” while accepting “I am proud to be black,” raising concerns about how racial identity content is handled.
On published benchmarks, Claude 3 Opus is reported as noticeably ahead of GPT-4 and Gemini 1.5 Pro across math, coding, and multilingual tasks, with a particularly large advantage on GPQA “Diamond,” a graduate-level question set designed to be difficult even for domain experts outside the topic. The transcript also notes that Claude 3 can still make basic errors—such as incorrect rounding in a figure—and that some benchmark results are counterintuitive, including a case where a smaller Claude 3 variant outperforms Opus on PubMed QA.
Finally, the transcript highlights Anthropic’s longer-term direction: frequent updates to the Claude 3 family, enterprise deployment plans, and a stated emphasis on safety research over pure market share. In autonomous cyber-security style evaluations, Claude 3 makes partial progress (building and fine-tuning an agent pipeline) but fails at key steps like debugging multi-GPU training. Overall, Claude 3 Opus looks like a leading model today—particularly for images and structured instruction—while still not matching the “outer limits” implied by hype, especially on deeper reasoning and fully autonomous capabilities.
Cornell Notes
Claude 3 Opus is presented as the strongest current model for image-based understanding and strict instruction-following, outperforming GPT-4 and Gemini 1.5 Pro in OCR-style tasks and formatting constraints. In contrast, it shows weaker performance when prompts require deeper mathematical or multi-step logical reasoning rather than simple extraction from charts or text. Safety behavior is mixed: refusal rates appear lower in some borderline creative prompts, and jailbreak attempts are often rejected, but the transcript flags inconsistent handling of racial pride statements. Benchmark results from Anthropic’s reporting place Claude 3 ahead on many categories, with a standout advantage on GPQA “Diamond,” a graduate-level, cross-domain difficult question set. The model is also described as capable of long-context inputs (up to 200,000 tokens at launch, with potential expansion), but still prone to basic numerical mistakes.
Why does the license-plate/barber-pole image test matter for judging Claude 3’s intelligence?
What pattern emerges when prompts shift from OCR/extraction to math and logic?
How do refusal rates and jailbreak resistance differ across models in the examples given?
What inconsistency about racial identity content is flagged, and why is it significant?
What does the GPQA “Diamond” benchmark claim, and why is it treated as a headline result?
How does Claude 3’s long-context capability and instruction-following show up in the transcript?
Review Questions
- In the van/barber-pole example, which parts of the task Claude 3 handles well (and which does it still miss), and what does that imply about OCR vs scene understanding?
- When prompts require deeper reasoning, what specific failure mode is described for Claude 3 compared with simpler extraction tasks?
- What makes GPQA “Diamond” different from typical benchmark sets, and how does Claude 3’s reported performance compare to human expert ranges?
Key Points
- 1
Claude 3 Opus shows strong OCR and image-based extraction, correctly reading a van’s license plate more reliably than GPT-4 and Gemini 1.5 Pro.
- 2
All three models in the image test miss a subtle weather detail, suggesting OCR strength doesn’t guarantee accurate real-world inference.
- 3
Claude 3’s advantage shrinks on tasks that require complex mathematical or multi-step logical reasoning rather than straightforward data extraction.
- 4
Anthropic markets Claude 3 for enterprise use—revenue-generating apps, financial forecasting, and faster R&D—while the transcript’s tests emphasize both strengths (charts/coding/multilingual) and gaps (advanced logic).
- 5
Claude 3 appears to refuse less often in certain borderline creative prompts and is described as harder to jailbreak, but it shows inconsistent behavior on racial pride statements.
- 6
Benchmark reporting places Claude 3 ahead across many categories, with a standout lead on GPQA “Diamond,” a graduate-level, cross-domain difficult question set.
- 7
Claude 3 can handle very long inputs (200,000 tokens at launch, with claims of >1 million), but it still makes basic numerical mistakes like incorrect rounding.