Gemini Ultra 1.0 - First Impression (vs ChatGPT 4)
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini Ultra 1.0’s interface feels familiar, with features like real-time responses, extensions, image upload, microphone prompts, and archived histories.
Briefing
Google’s Gemini Ultra 1.0 arrives with a familiar, ChatGPT-like interface and strong early performance on some tasks, but it also shows clear gaps on reasoning accuracy and coding reliability. In side-by-side tests against GPT-4, Gemini often produces plausible answers quickly—yet it can be vague where precision matters and occasionally misses multi-step logic.
The first benchmark focused on a simple physics-style word problem: hanging 5 shirts to dry in the sun takes 10 hours, so how long for 10 shirts under identical conditions? Gemini Ultra 1.0 initially responded with an imprecise “similar time frame,” and even when prompted again, it still didn’t land on the exact 10-hour expectation. GPT-4, by contrast, delivered the precise result—10 hours—when given the same prompt. The difference wasn’t about speed; it was about whether the model committed to a concrete, condition-matched conclusion.
A second test probed “world modeling” and step-by-step reasoning. A ball is placed into a bag with a hole in the bottom, carried from New York to an office, dropped into a box, sealed, and mailed to a friend in London. Gemini’s initial answer claimed the ball ends up in London, with follow-up drafts repeating the same flawed logic. GPT-4 offered a more consistent chain: because the hole is larger than the ball, the ball would fall out before shipping, so it would remain with the person in the office (or along the path), not with the London recipient. Gemini’s failure here was not subtle—it contradicted the prompt’s physical constraints.
Coding tests were more mixed. Gemini could generate a full Windows snake game quickly from a request for step-by-step guidance, and it eventually ran successfully after a few attempts. GPT-4 also produced code, but in this round Gemini’s first-run reliability lagged: Gemini needed multiple tries where GPT-4 often succeeded immediately. Another coding-related prompt—asking Gemini to explain a Python function—failed in a way that felt like a capability mismatch, with responses that either refused or returned generic “language model” limitations. That prompted a thumbs-down, especially since GPT-4 handled the explanation more cleanly.
Image generation and multimodal features were partially available. The interface supported image upload and analysis, and an uploaded image of giveaway items (including an Nvidia-themed design) was correctly described with objects and text. However, the creator couldn’t generate images directly at the time, suggesting feature gating or incomplete rollout. Overall, Gemini Ultra 1.0 looks polished and fast, but the early evidence points to uneven reasoning and occasional coding/explanation failures—areas where GPT-4 still holds an edge in these specific tests. The creator plans to revisit later, including API access, to verify whether these issues persist once the system is fully updated and more widely tested.
Cornell Notes
Gemini Ultra 1.0 shows a polished, ChatGPT-like interface with real-time responses, extensions, and multimodal options such as image upload and analysis. In a shirts-to-dry math problem, Gemini initially stayed vague (“similar time frame”) instead of giving the exact 10-hour answer that GPT-4 produced. In a multi-step “ball in a bag with a hole” world-modeling scenario, Gemini repeatedly placed the ball in London despite the hole implying it would fall out earlier; GPT-4 reasoned the ball would remain with the person in the office. Coding results were faster at times, but reliability varied—Gemini needed multiple attempts to get a snake game running and struggled to explain a Python function, while GPT-4 handled it better. Image upload worked, but direct image creation wasn’t available yet.
Why did Gemini lose the shirts-to-dry test even though it responded quickly?
What specific detail in the ball-and-bag scenario should force the ball to not reach London?
How did Gemini’s coding performance compare to GPT-4 in the snake game test?
What happened when Gemini was asked to explain a Python function?
What multimodal capability worked reliably, and what appeared unavailable?
Review Questions
- In the shirts-to-dry problem, what change in Gemini’s reasoning would have produced the correct 10-hour answer?
- In the ball-in-bag scenario, how would you rewrite the prompt to make the hole’s effect even harder to ignore?
- What coding failure mode did Gemini show in the snake game test, and how did the creator’s subsequent attempts address it?
Key Points
- 1
Gemini Ultra 1.0’s interface feels familiar, with features like real-time responses, extensions, image upload, microphone prompts, and archived histories.
- 2
In a shirts-to-dry math question, Gemini stayed vague (“similar time frame”) instead of giving the exact 10-hour result that GPT-4 delivered.
- 3
In a multi-step world-modeling scenario involving a bag with a hole larger than the ball, Gemini repeatedly placed the ball in London despite the physical implication that it should fall out earlier.
- 4
Gemini’s coding generation can be fast, but Windows snake game code required multiple attempts to run successfully, while GPT-4 was described as more reliable on the first try.
- 5
Gemini struggled with explaining a Python function, returning generic limitation-style responses that the creator found unacceptable.
- 6
Image upload and object/text recognition worked in the test, but direct image generation appeared unavailable at the time, suggesting rollout or access limits.