Google's Gemini just made GPT-4 look like a baby’s toy?

TL;DR

Gemini Ultra is claimed to outperform GPT-4 across nearly all benchmark categories, while Gemini Pro is described as underperforming GPT-4 in most situations.

Briefing Cornell Notes

Briefing

Google’s Gemini Ultra is positioned as a near-universal benchmark winner, with claims that it outperforms GPT-4 across almost every major category—while Gemini Pro lags behind GPT-4 in most tests. The practical stakes are immediate: Gemini is already being used inside Google’s Bard chatbot (via Gemini Pro), and the next wave—Nano and Pro on Google Cloud starting December 13, with Ultra/“Pro Max” later—could reshape what users expect from mainstream AI systems.

The model’s headline differentiator is multimodality: Gemini is trained to understand and generate across text, audio, images, and video. In Google’s demos, it can interpret a live video feed and respond in real time—recognizing a drawn duck, tracking an object through a “find the ball under the cup” game even after the cups are scrambled, and performing tasks like connect-the-dots. It also supports multimodal outputs, including generating images on the fly (with references to Stable Diffusion) and producing audio from prompts and even from images—such as generating “hair metal” music from an input prompt.

Beyond perception, Gemini is pitched as strong at logic and spatial reasoning. Examples include using two pictures to infer which car should go faster based on aerodynamics, and taking a photo of land to generate bridge blueprints—suggesting that engineering workflows could shift toward “input an image, get a structured design.” That theme extends to software as well, where Google simultaneously unveiled Alpha Code 2, described as performing better than 90% of competitive programmers and breaking down complex problems using techniques like dynamic programming.

Still, the transcript draws a sharp line between impressive demos and measurable performance. Gemini arrives in three sizes—Nano, Pro, and Ultra—with different intended use cases: Nano for on-device deployment (like Android), Pro for general use, and Ultra as the flagship. In the U.S., Bard currently uses Gemini Pro, which is described as fast and improved, but not quite matching GPT-4 Pro after a short trial. Benchmark results are framed as the deciding factor: Gemini Pro underperforms GPT-4 in most situations, while Gemini Ultra outperforms it in nearly every category.

The most consequential benchmark contrast is multitask language understanding, where Gemini Ultra is claimed to be the first model to outperform human experts on massive multitask language understanding (a SAT-like multiple-choice test across many subjects). Yet Gemini Ultra is also said to underperform GPT-4 on HellaSwag, a common-sense test that measures whether an AI can complete vague, ambiguous sentences in a way that feels human. The transcript treats that gap as a warning sign: strong reasoning and multimodal skills don’t automatically translate into everyday common sense.

Finally, the technical training details emphasize scale and infrastructure. Gemini training uses newly unveiled 5 tensor processing units arranged in super pods of 4,096 chips, connected via optical switching for fast data transfer, with dynamic reconfiguration into 3D topologies to reduce latency. The training mix is described as broad internet data—web pages and YouTube videos plus books and scientific papers—filtered for quality, then refined with reinforcement learning from human feedback to reduce hallucinations. Availability is staged: Nano and Pro on Google Cloud December 13, while Gemini Ultra/“Pro Max” waits for additional safety testing and is not expected until next year, including a requirement to hit 100% on the HellaSwag benchmark mentioned in the transcript.

Cornell Notes

Gemini Ultra is presented as Google’s flagship multimodal model that, according to benchmark claims, outperforms GPT-4 across nearly all categories—while Gemini Pro generally trails GPT-4. The biggest capability shift is multimodality: Gemini can understand video in real time, track objects across changing scenes, and generate outputs across modalities, including image and audio generation. Google also pairs Gemini with Alpha Code 2, which is described as outperforming most competitive programmers on complex coding tasks. Despite the strong results, the transcript flags a key weakness: Gemini Ultra is said to underperform GPT-4 on HellaSwag, a common-sense benchmark tied to how “human” completions feel. Availability is staged, with Nano and Pro arriving on Google Cloud first and Ultra later after additional safety testing.

What makes Gemini different from text-only models like GPT-4 in the transcript’s framing?

Gemini is described as a multimodal large language model, meaning it’s trained not only on text but also on sound, images, and video. In demos, it can interpret a live video feed and respond in real time (e.g., recognizing a duck in a drawing), track an object through a changing scene (find-the-ball under scrambled cups), and perform spatial tasks like connect-the-dots. It also supports multimodal outputs: generating images from prompts and producing audio from prompts (text-to-audio) and even from images (image-to-audio).

How do the three Gemini sizes (Nano, Pro, Ultra) differ in intended use and performance expectations?

The transcript says Gemini comes in three sizes: Nano, Grande, and ventti (as named in the text), with the smallest version intended for embedding on devices like Android phones. The mid-range “Pro” is positioned as general purpose and is what Bard uses in the U.S. The largest “Ultra” is the flagship model that’s “blowing everybody’s minds,” with claims that it outperforms GPT-4 on almost every benchmark category, while Gemini Pro underperforms GPT-4 in most situations.

Which benchmarks are highlighted as Gemini’s strengths versus its weaknesses?

Strengths: Gemini Ultra is claimed to outperform human experts on massive multitask language understanding (a SAT-like multiple-choice test across many subjects). Weakness: Gemini Ultra is said to underperform GPT-4 on HellaSwag, a common-sense benchmark that tests whether the model can complete vague, ambiguous sentences in a way that matches typical human expectations. The transcript treats this as concerning because common-sense completion is a key part of sounding human.

What training infrastructure details are given for Gemini, and why do they matter?

Training is described as using newly unveiled 5 tensor processing units arranged in super pods of 4,096 chips. Each super pod has a dedicated optical switch for fast data transfer between pods, enabling parallel training. The system can dynamically reconfigure into 3D topologies (described as “shape shifting” into donut-like structures) to reduce latency. The scale is also said to require communication between multiple data centers, underscoring that performance claims depend on massive compute and networking, not just model architecture.

How does the transcript connect Gemini’s capabilities to software engineering and coding?

It links multimodal reasoning to engineering workflows (e.g., taking a picture of land and generating bridge blueprints) and also introduces Alpha Code 2. Alpha Code 2 is described as performing better than 90% of competitive programmers, solving complex abstract problems by breaking them into smaller subproblems using techniques like dynamic programming. The implication is that both engineering design and software problem-solving could be increasingly automated.

What does the transcript say about availability and safety gating for Gemini Ultra?

Nano and Pro are said to become available on Google Cloud on December 13. Gemini Ultra/“Pro Max” is described as not arriving until next year, with the delay attributed to additional safety tests. The transcript also claims Ultra must reach 100% on the HellaSwag benchmark before release, tying safety/quality gating to benchmark performance.

Review Questions

Which benchmark does the transcript describe as showing Gemini Ultra beating human experts, and what format does that benchmark use?
What specific kind of failure does HellaSwag measure, and why does the transcript treat that as a “human-likeness” problem?
How do the transcript’s training details (TPU super pods, optical switching, and dynamic 3D topology) support the claim that Gemini’s performance depends on scale?

Key Points

1
Gemini Ultra is claimed to outperform GPT-4 across nearly all benchmark categories, while Gemini Pro is described as underperforming GPT-4 in most situations.
2
Gemini’s core capability shift is multimodality: it can understand video in real time and generate outputs across text, images, and audio.
3
Google’s demos emphasize ongoing video understanding, including tracking an object through scene changes and performing spatial tasks like connect-the-dots.
4
The transcript flags a major benchmark gap: Gemini Ultra is said to underperform GPT-4 on HellaSwag, a common-sense completion test tied to how human-like answers feel.
5
Gemini training is described as compute- and networking-intensive, using 5 tensor processing units in 4,096-chip super pods with optical switching and dynamic 3D reconfiguration.
6
Alpha Code 2 is introduced as a parallel push in coding, described as outperforming most competitive programmers by decomposing problems with techniques like dynamic programming.
7
Gemini availability is staged: Nano and Pro on Google Cloud December 13, with Ultra/“Pro Max” delayed until next year pending additional safety testing and benchmark targets.

Highlights

Gemini Ultra is presented as the first model claimed to beat human experts on massive multitask language understanding, a SAT-like multiple-choice benchmark across many subjects.

Real-time video understanding is a centerpiece claim: Gemini can track a hidden object even after the cups are scrambled.

The transcript’s biggest caution flag is HellaSwag: Gemini Ultra may be strong at reasoning and multimodal tasks yet still miss everyday common-sense completions.

Gemini’s training setup is described as a large-scale system: 4,096-chip TPU super pods with optical switching and dynamic 3D topology reconfiguration. 

Topics

Gemini Ultra Benchmarks
Multimodal AI
Bard Gemini Pro
HellaSwag Common Sense
Alpha Code 2