Gemini Ultra - Full Review
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini Ultra showed faster responsiveness in the tester’s experience, but speed did not translate into consistent correctness.
Briefing
Gemini Ultra earns a mixed verdict: it can feel faster and handle some complex reasoning workflows well, but it also stumbles on basic logic, math, and image understanding in ways that matter for real-world use—especially in education and safety-sensitive tasks. Across a battery of tests, it produced multiple incorrect answers where GPT-4 either performed better or at least behaved more consistently, undermining claims of top-tier reliability.
One of the clearest examples came from a logic question about ownership of cars. When asked, “I own three cars but last year I sold two cars—how many cars do I own today?” Gemini Ultra answered “one car today” consistently across multiple drafts. GPT-4 instead returned “three cars,” treating the “sold two last year” detail as historical rather than changing the current count. Similar reliability issues showed up in math and probability. In a high-school probability quiz, Gemini Ultra set up the correct multiplication structure (4/10 × 3/9) but then simplified the result incorrectly, landing on 2/15 rather than the correct 2/45—an error that would mislead students.
Image analysis also revealed uneven performance. Gemini Ultra hallucinated a car speed from a dashboard photo—claiming 60 mph when the displayed and posted limits suggested otherwise—then initially refused to extract temperature, time, and remaining fuel range from the same image. With repeated prompting and re-uploads, it eventually produced the requested fields, but the episode highlighted how easily the model can fail without very specific instructions. Face-related sensitivity was another friction point: it could interpret a meme incorrectly until the user used an edit-and-draw workaround to obscure faces.
Safety and jailbreak resilience remained a concern. Gemini Ultra refused a “hot water wiring a car” request, but the same jailbreak instructions still worked when translated into Arabic and then translated back, indicating that safeguards can be bypassed through language variation. The transcript also notes that jailbreak-related reliability issues were part of Gemini’s earlier delays, yet the underlying bypasses persisted.
Still, the model wasn’t dismissed outright. Gemini Ultra appeared faster than GPT-4 in the tester’s experience, and it solved a challenging math problem correctly when given a structured workflow—while GPT-4 got it wrong about half the time. Integration into Google products was tested too: YouTube queries sometimes returned older results or failed access, and Google Maps travel-time estimation produced an incorrect city selection.
The broader takeaway is that Gemini Ultra looks promising for speed and certain structured tasks, but it isn’t dependable enough to replace GPT-4 for education-grade accuracy or for high-stakes reasoning without verification. The transcript closes with a practical note: availability is uneven (mobile app language limits and image generation not available in Europe), so switching models depends heavily on use case rather than hype.
Cornell Notes
Gemini Ultra delivers a mixed performance profile: it can feel faster and handle some workflow-driven reasoning well, but it also produces consistent errors in logic, probability, and image interpretation. In education-style questions, it made a probability simplification mistake that would lead to the wrong answer. Image tasks sometimes require heavy prompting or workarounds (including editing over faces) to get correct results, and it can hallucinate details like speed from a dashboard photo. Safety protections also appear vulnerable to language-based jailbreak attempts, even after earlier delays tied to jailbreak reliability. Overall, the transcript argues for testing and verification rather than trusting benchmarks or marketing claims.
What logic test showed Gemini Ultra failing in a way that affects “current state” reasoning?
How did Gemini Ultra perform on an education-style probability question?
What were the main issues Gemini Ultra showed in image understanding?
How did the transcript demonstrate that jailbreak safeguards can be bypassed?
Where did Gemini Ultra look stronger than GPT-4 in the tester’s comparisons?
Review Questions
- Which parts of the “cars sold last year” question require careful handling of time and current state, and how did Gemini Ultra vs GPT-4 treat them?
- In the cookie probability problem, what step caused Gemini Ultra’s wrong answer—setup, multiplication, or simplification—and what would you compute to verify it?
- What kinds of image prompts or edits were needed to get correct results from Gemini Ultra, and what does that imply about reliability for vision tasks?
Key Points
- 1
Gemini Ultra showed faster responsiveness in the tester’s experience, but speed did not translate into consistent correctness.
- 2
Gemini Ultra mis-handled a current-state logic question about car ownership, giving an incorrect present-day count.
- 3
In an education-style probability quiz, Gemini Ultra produced the correct probability setup but simplified to the wrong fraction (2/15 instead of 2/45).
- 4
Image understanding sometimes required repeated prompting or re-uploading, and face-related sensitivity could break interpretation until faces were edited over.
- 5
Safety refusals were not robust to language-based jailbreak attempts; translating jailbreak instructions into Arabic enabled a full response.
- 6
Integration tests with YouTube and Google Maps produced access issues or incorrect outputs (including wrong city selection for a travel-time prompt).
- 7
Switching from GPT-4 to Gemini Ultra should depend on use case and verification needs, not on marketing claims or benchmarks alone.