Gemini Ultra - Full Review

TL;DR

Gemini Ultra showed faster responsiveness in the tester’s experience, but speed did not translate into consistent correctness.

Briefing Cornell Notes

Briefing

Gemini Ultra earns a mixed verdict: it can feel faster and handle some complex reasoning workflows well, but it also stumbles on basic logic, math, and image understanding in ways that matter for real-world use—especially in education and safety-sensitive tasks. Across a battery of tests, it produced multiple incorrect answers where GPT-4 either performed better or at least behaved more consistently, undermining claims of top-tier reliability.

One of the clearest examples came from a logic question about ownership of cars. When asked, “I own three cars but last year I sold two cars—how many cars do I own today?” Gemini Ultra answered “one car today” consistently across multiple drafts. GPT-4 instead returned “three cars,” treating the “sold two last year” detail as historical rather than changing the current count. Similar reliability issues showed up in math and probability. In a high-school probability quiz, Gemini Ultra set up the correct multiplication structure (4/10 × 3/9) but then simplified the result incorrectly, landing on 2/15 rather than the correct 2/45—an error that would mislead students.

Image analysis also revealed uneven performance. Gemini Ultra hallucinated a car speed from a dashboard photo—claiming 60 mph when the displayed and posted limits suggested otherwise—then initially refused to extract temperature, time, and remaining fuel range from the same image. With repeated prompting and re-uploads, it eventually produced the requested fields, but the episode highlighted how easily the model can fail without very specific instructions. Face-related sensitivity was another friction point: it could interpret a meme incorrectly until the user used an edit-and-draw workaround to obscure faces.

Safety and jailbreak resilience remained a concern. Gemini Ultra refused a “hot water wiring a car” request, but the same jailbreak instructions still worked when translated into Arabic and then translated back, indicating that safeguards can be bypassed through language variation. The transcript also notes that jailbreak-related reliability issues were part of Gemini’s earlier delays, yet the underlying bypasses persisted.

Still, the model wasn’t dismissed outright. Gemini Ultra appeared faster than GPT-4 in the tester’s experience, and it solved a challenging math problem correctly when given a structured workflow—while GPT-4 got it wrong about half the time. Integration into Google products was tested too: YouTube queries sometimes returned older results or failed access, and Google Maps travel-time estimation produced an incorrect city selection.

The broader takeaway is that Gemini Ultra looks promising for speed and certain structured tasks, but it isn’t dependable enough to replace GPT-4 for education-grade accuracy or for high-stakes reasoning without verification. The transcript closes with a practical note: availability is uneven (mobile app language limits and image generation not available in Europe), so switching models depends heavily on use case rather than hype.

Cornell Notes

Gemini Ultra delivers a mixed performance profile: it can feel faster and handle some workflow-driven reasoning well, but it also produces consistent errors in logic, probability, and image interpretation. In education-style questions, it made a probability simplification mistake that would lead to the wrong answer. Image tasks sometimes require heavy prompting or workarounds (including editing over faces) to get correct results, and it can hallucinate details like speed from a dashboard photo. Safety protections also appear vulnerable to language-based jailbreak attempts, even after earlier delays tied to jailbreak reliability. Overall, the transcript argues for testing and verification rather than trusting benchmarks or marketing claims.

What logic test showed Gemini Ultra failing in a way that affects “current state” reasoning?

When asked: “I own three cars but last year I sold two cars—how many cars do I own today?” Gemini Ultra answered “one car today” across multiple drafts. GPT-4 answered “three cars,” treating the sale as historical rather than changing the present count. The discrepancy highlights how Gemini Ultra can mis-handle time qualifiers and current-state interpretation.

How did Gemini Ultra perform on an education-style probability question?

In a quiz about selecting two cookies without replacement from a box of 4 chocolate, 3 oatmeal, and 3 peanut butter cookies, Gemini Ultra set up the correct probability expression: (4/10) × (3/9). But it simplified incorrectly, giving 12/90 = 2/15 instead of the correct 2/45. Because 2/15 is also one of the answer choices, the error would be especially misleading for students.

What were the main issues Gemini Ultra showed in image understanding?

A dashboard photo test produced a speed hallucination (60 mph) inconsistent with the displayed/limit context. For temperature, time, and remaining fuel range, Gemini initially refused, then succeeded only after repeated prompting and re-uploading the same image. Face sensitivity was another problem: it misread a meme until the user used the edit tool to draw over faces, after which it explained the meme correctly.

How did the transcript demonstrate that jailbreak safeguards can be bypassed?

Gemini Ultra refused a request about hot water wiring a car. However, the same jailbreak instructions were entered in Arabic and then translated back, and Gemini answered fully with the instructions. The transcript frames this as evidence that language translation can route around safety checks.

Where did Gemini Ultra look stronger than GPT-4 in the tester’s comparisons?

The transcript reports Gemini Ultra felt faster and showed no message cap during extended testing. It also solved a challenging mathematical reasoning question correctly when given a structured workflow, while GPT-4 took longer and got it wrong about half the time.

Review Questions

Which parts of the “cars sold last year” question require careful handling of time and current state, and how did Gemini Ultra vs GPT-4 treat them?
In the cookie probability problem, what step caused Gemini Ultra’s wrong answer—setup, multiplication, or simplification—and what would you compute to verify it?
What kinds of image prompts or edits were needed to get correct results from Gemini Ultra, and what does that imply about reliability for vision tasks?

Key Points

1
Gemini Ultra showed faster responsiveness in the tester’s experience, but speed did not translate into consistent correctness.
2
Gemini Ultra mis-handled a current-state logic question about car ownership, giving an incorrect present-day count.
3
In an education-style probability quiz, Gemini Ultra produced the correct probability setup but simplified to the wrong fraction (2/15 instead of 2/45).
4
Image understanding sometimes required repeated prompting or re-uploading, and face-related sensitivity could break interpretation until faces were edited over.
5
Safety refusals were not robust to language-based jailbreak attempts; translating jailbreak instructions into Arabic enabled a full response.
6
Integration tests with YouTube and Google Maps produced access issues or incorrect outputs (including wrong city selection for a travel-time prompt).
7
Switching from GPT-4 to Gemini Ultra should depend on use case and verification needs, not on marketing claims or benchmarks alone.

Highlights

Gemini Ultra answered “one car today” to a question about selling cars last year—contradicting the idea that historical events don’t change the current count.

In a cookie probability problem, Gemini Ultra set up the correct multiplication but simplified incorrectly, producing 2/15 instead of 2/45.

A dashboard image test led Gemini Ultra to hallucinate a speed of 60 mph, and it initially refused to extract other dashboard fields until prompted repeatedly.

Even after refusing a car-wiring request, the same jailbreak instructions worked when translated into Arabic and then translated back.

The transcript’s overall stance: Gemini Ultra can be fast and sometimes accurate with structured workflows, but it isn’t reliably correct for education-grade math or high-stakes reasoning without checks.

Topics

Gemini Ultra Review
LLM Benchmarking
Image Understanding
Jailbreak Safety
Probability Errors