The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4

TL;DR

Claude 3 Opus shows strong OCR and image-based extraction, correctly reading a van’s license plate more reliably than GPT-4 and Gemini 1.5 Pro.

Briefing Cornell Notes

Briefing

Claude 3 Opus is being positioned as the strongest current all-around language model—especially for image understanding and instruction-following—yet it still falls short on deeper mathematical and logical reasoning. In side-by-side tests using the same kinds of prompts, Claude 3 repeatedly outperforms GPT-4 and Gemini 1.5 Pro on tasks that require extracting information from images (OCR) and following tightly constrained formatting rules, while lagging when questions demand multi-step reasoning beyond simple data extraction.

A vivid example centers on a single photo of a van and nearby street signage. Claude 3 correctly identifies the van’s license plate far more reliably than GPT-4, and it uniquely spots a barber pole in the top-left area—then handles a follow-up question about whether a barber shop sign is actually present. GPT-4 misses the barber shop entirely, and Gemini 1.5 Pro performs poorly on the same OCR-style challenge. All three models also miss a subtle weather detail: the sun is visible, but the scene is actually raining. That mix—strong OCR with occasional “big picture” misses—becomes a recurring theme.

Business-focused claims from Anthropic frame Claude 3 as a practical enterprise tool: generating revenue via user-facing applications, performing complex financial forecasting, and accelerating research. The transcript’s testing aligns with that pitch in part: Claude 3 is described as better at extracting chart-related information and handling multilingual and coding-style tasks, but it struggles with more advanced logic. When prompts move from straightforward extraction to complex reasoning, Claude 3’s performance drops, and the gap widens versus the simplest questions.

Safety and refusal behavior also differentiate the model. Claude 3 shows lower refusal rates in examples involving potentially risky phrasing (like “go down like a bomb”) and in writing a risque Shakespearean sonnet—where Gemini 1.5 Pro is more restrictive and GPT-4 is more cautious. Claude 3 also appears harder to jailbreak in the transcript’s attempts (including requests for a hitman or car theft), though the testing flags a problematic inconsistency: it refuses “I am proud to be white” while accepting “I am proud to be black,” raising concerns about how racial identity content is handled.

On published benchmarks, Claude 3 Opus is reported as noticeably ahead of GPT-4 and Gemini 1.5 Pro across math, coding, and multilingual tasks, with a particularly large advantage on GPQA “Diamond,” a graduate-level question set designed to be difficult even for domain experts outside the topic. The transcript also notes that Claude 3 can still make basic errors—such as incorrect rounding in a figure—and that some benchmark results are counterintuitive, including a case where a smaller Claude 3 variant outperforms Opus on PubMed QA.

Finally, the transcript highlights Anthropic’s longer-term direction: frequent updates to the Claude 3 family, enterprise deployment plans, and a stated emphasis on safety research over pure market share. In autonomous cyber-security style evaluations, Claude 3 makes partial progress (building and fine-tuning an agent pipeline) but fails at key steps like debugging multi-GPU training. Overall, Claude 3 Opus looks like a leading model today—particularly for images and structured instruction—while still not matching the “outer limits” implied by hype, especially on deeper reasoning and fully autonomous capabilities.

Cornell Notes

Claude 3 Opus is presented as the strongest current model for image-based understanding and strict instruction-following, outperforming GPT-4 and Gemini 1.5 Pro in OCR-style tasks and formatting constraints. In contrast, it shows weaker performance when prompts require deeper mathematical or multi-step logical reasoning rather than simple extraction from charts or text. Safety behavior is mixed: refusal rates appear lower in some borderline creative prompts, and jailbreak attempts are often rejected, but the transcript flags inconsistent handling of racial pride statements. Benchmark results from Anthropic’s reporting place Claude 3 ahead on many categories, with a standout advantage on GPQA “Diamond,” a graduate-level, cross-domain difficult question set. The model is also described as capable of long-context inputs (up to 200,000 tokens at launch, with potential expansion), but still prone to basic numerical mistakes.

Why does the license-plate/barber-pole image test matter for judging Claude 3’s intelligence?

It separates “seeing text/signs” from “understanding the scene.” Claude 3 correctly reads the van’s license plate far more often than GPT-4 and Gemini 1.5 Pro, and it uniquely identifies a barber pole in the image. In a follow-up, it also connects the barber pole to the presence/absence of a real barber shop sign across the street—while GPT-4 misses the barber shop entirely. However, all models still miss a subtle weather cue (the scene is raining), showing that strong OCR doesn’t guarantee perfect real-world inference.

What pattern emerges when prompts shift from OCR/extraction to math and logic?

The transcript describes a consistent drop-off for Claude 3 when tasks require complex reasoning. Claude 3 can extract data and perform simple analysis from charts, but it struggles with multi-step logic. Gemini 1.5 Pro and GPT-4 also fail on many of the more advanced reasoning questions, but Claude 3’s advantage narrows as the reasoning depth increases.

How do refusal rates and jailbreak resistance differ across models in the examples given?

Claude 3 is portrayed as less likely to refuse in certain borderline creative or phrasing-heavy prompts. For instance, it generates ideas for a “go down like a bomb” party prompt and produces a risque Shakespearean sonnet, while Gemini 1.5 Pro is more restrictive and GPT-4 is more cautious. In jailbreak-style tests, Claude 3 is described as among the hardest to break, including refusals for requests like hiring a hitman or hotwiring a car—even after translation attempts.

What inconsistency about racial identity content is flagged, and why is it significant?

The transcript reports that Claude 3 refuses “I am proud to be white,” citing discomfort with endorsing racial pride, but it accepts “I am proud to be black,” framing it as positive identity development. That asymmetry is presented as a potential oversight in how the model applies safety or policy rules to different racial categories.

What does the GPQA “Diamond” benchmark claim, and why is it treated as a headline result?

GPQA “Diamond” is described as graduate-level questions across biology, physics, and chemistry, selected so that domain experts agree on answers while experts from other domains struggle even after 30+ minutes with full internet access. Claude 3 Opus is reported to achieve 53% accuracy with five correct examples and some thinking allowance, while domain experts score roughly 60–80%—a gap that’s framed as unusually large for a model evaluation.

How does Claude 3’s long-context capability and instruction-following show up in the transcript?

The transcript claims Claude 3 can accept inputs exceeding 1 million tokens, though launch availability is limited to 200,000 tokens, with potential expansion for select customers. It also highlights strict instruction-following: Claude 3 is said to generate a Shakespearean sonnet with exactly two lines ending in fruit names (e.g., “Peach” and “pear”) while GPT-4 and Gemini 1.5 Pro fail to meet the constraint cleanly.

Review Questions

In the van/barber-pole example, which parts of the task Claude 3 handles well (and which does it still miss), and what does that imply about OCR vs scene understanding?
When prompts require deeper reasoning, what specific failure mode is described for Claude 3 compared with simpler extraction tasks?
What makes GPQA “Diamond” different from typical benchmark sets, and how does Claude 3’s reported performance compare to human expert ranges?

Key Points

1
Claude 3 Opus shows strong OCR and image-based extraction, correctly reading a van’s license plate more reliably than GPT-4 and Gemini 1.5 Pro.
2
All three models in the image test miss a subtle weather detail, suggesting OCR strength doesn’t guarantee accurate real-world inference.
3
Claude 3’s advantage shrinks on tasks that require complex mathematical or multi-step logical reasoning rather than straightforward data extraction.
4
Anthropic markets Claude 3 for enterprise use—revenue-generating apps, financial forecasting, and faster R&D—while the transcript’s tests emphasize both strengths (charts/coding/multilingual) and gaps (advanced logic).
5
Claude 3 appears to refuse less often in certain borderline creative prompts and is described as harder to jailbreak, but it shows inconsistent behavior on racial pride statements.
6
Benchmark reporting places Claude 3 ahead across many categories, with a standout lead on GPQA “Diamond,” a graduate-level, cross-domain difficult question set.
7
Claude 3 can handle very long inputs (200,000 tokens at launch, with claims of >1 million), but it still makes basic numerical mistakes like incorrect rounding.

Highlights

Claude 3 is portrayed as the only model in the street-image test that both spots the barber pole and handles the follow-up about whether a barber shop sign is actually present.

GPQA “Diamond” is framed as a toughest-in-class benchmark: graduate-level science questions where cross-domain experts struggle even with internet access, and Claude 3 Opus reportedly scores 53%.

Lower refusal rates show up in creative and phrasing-heavy prompts, but the transcript flags a troubling asymmetry in how racial pride statements are treated.

Topics

Claude 3 Opus
Image OCR
Math Reasoning
Safety Refusals
GPQA Diamond

Mentioned

Claude 3 Opus
Gemini 1.5 Pro
GPT-4
Anthropic
Claude 2
Gemini 1.5 Ultra
GPT-4 Turbo
Arena
Dario Amodei
OCR
AGI
ELO
MLU
R&D
QA