Gemini Ultra 1.0 - First Impression (vs ChatGPT 4)

TL;DR

Gemini Ultra 1.0’s interface feels familiar, with features like real-time responses, extensions, image upload, microphone prompts, and archived histories.

Briefing Cornell Notes

Briefing

Google’s Gemini Ultra 1.0 arrives with a familiar, ChatGPT-like interface and strong early performance on some tasks, but it also shows clear gaps on reasoning accuracy and coding reliability. In side-by-side tests against GPT-4, Gemini often produces plausible answers quickly—yet it can be vague where precision matters and occasionally misses multi-step logic.

The first benchmark focused on a simple physics-style word problem: hanging 5 shirts to dry in the sun takes 10 hours, so how long for 10 shirts under identical conditions? Gemini Ultra 1.0 initially responded with an imprecise “similar time frame,” and even when prompted again, it still didn’t land on the exact 10-hour expectation. GPT-4, by contrast, delivered the precise result—10 hours—when given the same prompt. The difference wasn’t about speed; it was about whether the model committed to a concrete, condition-matched conclusion.

A second test probed “world modeling” and step-by-step reasoning. A ball is placed into a bag with a hole in the bottom, carried from New York to an office, dropped into a box, sealed, and mailed to a friend in London. Gemini’s initial answer claimed the ball ends up in London, with follow-up drafts repeating the same flawed logic. GPT-4 offered a more consistent chain: because the hole is larger than the ball, the ball would fall out before shipping, so it would remain with the person in the office (or along the path), not with the London recipient. Gemini’s failure here was not subtle—it contradicted the prompt’s physical constraints.

Coding tests were more mixed. Gemini could generate a full Windows snake game quickly from a request for step-by-step guidance, and it eventually ran successfully after a few attempts. GPT-4 also produced code, but in this round Gemini’s first-run reliability lagged: Gemini needed multiple tries where GPT-4 often succeeded immediately. Another coding-related prompt—asking Gemini to explain a Python function—failed in a way that felt like a capability mismatch, with responses that either refused or returned generic “language model” limitations. That prompted a thumbs-down, especially since GPT-4 handled the explanation more cleanly.

Image generation and multimodal features were partially available. The interface supported image upload and analysis, and an uploaded image of giveaway items (including an Nvidia-themed design) was correctly described with objects and text. However, the creator couldn’t generate images directly at the time, suggesting feature gating or incomplete rollout. Overall, Gemini Ultra 1.0 looks polished and fast, but the early evidence points to uneven reasoning and occasional coding/explanation failures—areas where GPT-4 still holds an edge in these specific tests. The creator plans to revisit later, including API access, to verify whether these issues persist once the system is fully updated and more widely tested.

Cornell Notes

Gemini Ultra 1.0 shows a polished, ChatGPT-like interface with real-time responses, extensions, and multimodal options such as image upload and analysis. In a shirts-to-dry math problem, Gemini initially stayed vague (“similar time frame”) instead of giving the exact 10-hour answer that GPT-4 produced. In a multi-step “ball in a bag with a hole” world-modeling scenario, Gemini repeatedly placed the ball in London despite the hole implying it would fall out earlier; GPT-4 reasoned the ball would remain with the person in the office. Coding results were faster at times, but reliability varied—Gemini needed multiple attempts to get a snake game running and struggled to explain a Python function, while GPT-4 handled it better. Image upload worked, but direct image creation wasn’t available yet.

Why did Gemini lose the shirts-to-dry test even though it responded quickly?

The prompt required a precise, condition-matched conclusion: 5 shirts dry in 10 hours under identical conditions, so 10 shirts should also dry in 10 hours. Gemini’s answers leaned on a vague “similar time frame” rather than committing to the exact number, while GPT-4 returned “10 hours,” making the comparison about precision, not speed.

What specific detail in the ball-and-bag scenario should force the ball to not reach London?

The bag has a hole in the bottom bigger than the ball. That means when the ball is placed into the bag and later dropped into the box, the ball would fall out before shipping. Gemini still concluded the ball ends up in London across multiple drafts, while GPT-4 reasoned the ball would remain with the person (e.g., on the office floor or along the path) rather than with the London recipient.

How did Gemini’s coding performance compare to GPT-4 in the snake game test?

Gemini generated a full Windows snake game quickly, but the first runs failed with errors and required multiple attempts before it worked. GPT-4’s code in the same general task was described as more likely to run correctly on the first try. The takeaway is that Gemini’s code generation speed was strong, but execution reliability lagged in this round.

What happened when Gemini was asked to explain a Python function?

When prompted to explain a Python function, Gemini returned responses that effectively refused or defaulted to generic limitations (e.g., “I’m just a language model” / inability to help). The creator suspected a context-window or early rollout issue, but the result was still a failure compared with GPT-4, which produced a working explanation.

What multimodal capability worked reliably, and what appeared unavailable?

Image upload and analysis worked: an uploaded image was correctly interpreted, including identifying items like a purple t-shirt, black socks, a white mug, and text elements such as an Nvidia-themed message. Direct image creation wasn’t available at the time, with the interface indicating the creator couldn’t create images yet, suggesting feature gating or incomplete rollout.

Review Questions

In the shirts-to-dry problem, what change in Gemini’s reasoning would have produced the correct 10-hour answer?
In the ball-in-bag scenario, how would you rewrite the prompt to make the hole’s effect even harder to ignore?
What coding failure mode did Gemini show in the snake game test, and how did the creator’s subsequent attempts address it?

Key Points

1
Gemini Ultra 1.0’s interface feels familiar, with features like real-time responses, extensions, image upload, microphone prompts, and archived histories.
2
In a shirts-to-dry math question, Gemini stayed vague (“similar time frame”) instead of giving the exact 10-hour result that GPT-4 delivered.
3
In a multi-step world-modeling scenario involving a bag with a hole larger than the ball, Gemini repeatedly placed the ball in London despite the physical implication that it should fall out earlier.
4
Gemini’s coding generation can be fast, but Windows snake game code required multiple attempts to run successfully, while GPT-4 was described as more reliable on the first try.
5
Gemini struggled with explaining a Python function, returning generic limitation-style responses that the creator found unacceptable.
6
Image upload and object/text recognition worked in the test, but direct image generation appeared unavailable at the time, suggesting rollout or access limits.

Highlights

Gemini’s “similar time frame” answer failed a precision test where GPT-4 gave the exact 10-hour drying time.

Gemini repeatedly got the ball-and-bag logic wrong, claiming the ball reached London even though the hole should drop it before shipping.

Gemini generated snake game code quickly, but it didn’t run cleanly until later attempts—execution reliability lagged behind GPT-4 in this round.

Uploaded images were interpreted correctly (objects and Nvidia-themed text), while image creation itself wasn’t available during the test.

Topics

Gemini Ultra 1.0
GPT-4 Comparison
World Modeling
Windows Coding
Image Upload