Testing Gemini 1.5 and a 1 Million Token Window

TL;DR

Gemini 1.5 Pro is presented as a mixture of experts (MoE) model, publicly confirming long-running rumors about MoE-style architectures.

Briefing Cornell Notes

Briefing

Gemini 1.5 Pro marks a major step up for long-context AI: it pairs a newly updated model with a dramatically expanded context window—up to 1,048,576 tokens—making it practical to analyze large collections of text, long documents, and even hours of media in a single run. The practical payoff is speed-to-insight on tasks that previously failed when relevant details were buried deep inside lengthy inputs.

Google positions Gemini 1.5 as a mixture of experts (MoE) architecture, publicly confirming what had long been rumored for other frontier models. In the same breath, it upgrades the context window story. A baseline Gemini 1.5 Pro configuration is described as supporting 128,000 tokens (roughly in line with other top-tier offerings), while a larger variant pushes to a million tokens. That scale translates into concrete use cases: about an hour of video, roughly 11 hours of audio, on the order of 30,000 lines of code, and around 700,000 words—enough to keep entire knowledge bases or large corpora “in view” during inference.

To test whether long-context retrieval is more than marketing, the transcript points to “needle in a haystack” evaluations—work associated with Greg Kamradt—where models struggle to locate a specific item buried in massive text. Google’s claim, as relayed here, is that Gemini can find such targeted information at token distances approaching the million-token range, outperforming many models that fail under similar conditions.

Hands-on testing begins in Google AI Studio using Gemini 1.5 Pro with a large uploaded dataset: merged LangChain documentation saved as text files. The setup shows the system can accept 1,048,576 tokens total, with the run consuming about 526,000 tokens. After an inference delay (about 50–60 seconds at this scale), Gemini identifies concepts like LCEL (LangChain Expression Language) and LangGraph. It also attempts code generation from the documentation, producing a LangChain-style snippet that references Gemini Pro and includes guidance such as where a Google API key should go.

Media understanding is then tested with uploaded videos. In a demo-style clip, Gemini produces scene-by-scene outlines—identifying a duck drawing, a country-guessing segment, a coin magic trick, rock-paper-scissors gestures, and a connect-the-dots puzzle—generally matching the visible events. A second experiment uses the “monkey business illusion” (selective attention test): Gemini estimates the number of passes by players in white (reported as 15, with the “correct” value suggested as 16), detects the gorilla in the middle, and counts curtain color changes (described as once). Finally, a longer Andrew Ng presentation video is processed visually (not via audio), and Gemini extracts slide “chapters” and summarizes the slides with reasonable consistency, completing a 36-minute deck in about two minutes.

The overall takeaway is a first practical look at what a million-token context enables: deeper retrieval, larger document ingestion, and more capable video/slide comprehension—though the workflow is still described as somewhat buggy, with UI rough edges likely to improve as access expands in the coming weeks.

Cornell Notes

Gemini 1.5 Pro combines an updated model with a much larger context window, up to 1,048,576 tokens. The transcript highlights two big shifts: public confirmation that Gemini 1.5 uses a mixture of experts (MoE) architecture, and a new long-context capability that supports tasks like “needle in a haystack” retrieval at very large token distances. In Google AI Studio, Gemini successfully answers questions about merged LangChain documentation using roughly half a million tokens, then generates a simple chain example. It also outlines scenes in a demo video, performs reasonably on the monkey business illusion (counting passes, spotting a gorilla, and detecting curtain color changes), and extracts slide structure from an Andrew Ng presentation video using visual information rather than audio. The result is a clearer sense of what million-token inputs unlock for real workflows.

What are the two headline upgrades in Gemini 1.5 Pro mentioned here, and why do they matter?

First, Gemini 1.5 Pro is described as an updated model that Google publicly confirms uses a mixture of experts (MoE) architecture—an approach previously associated with other frontier models like Mixtral. Second, it introduces a much larger context window: a standard Gemini 1.5 Pro configuration is listed as 128,000 tokens, while a larger variant supports 1,048,576 tokens. The MoE framing matters for model efficiency and scaling; the million-token context matters because it allows relevant information to remain accessible even when buried far inside long inputs.

How large is the “million token” context in practical terms, according to the transcript?

The transcript translates the million-token window into concrete media and data sizes: about an hour of video, roughly 11 hours of audio, around 30,000 lines of code, and approximately 700,000 words. It also compares earlier context sizes (e.g., Gemini Pro previously at 32K, and other systems like GPT-4 Turbo at 128K and Claude 2.1 at 200K) to emphasize the jump.

What “needle in a haystack” test is referenced, and what does Google claim Gemini 1.5 can do?

The transcript references open-source testing associated with Greg Kamradt, where a specific piece of information is buried inside a very long context and the model must retrieve it. Many models reportedly fail at these distances. Google’s claim here is that Gemini can locate such targeted information at token lengths up to about a million tokens, implying stronger long-range retrieval behavior.

What happened when Gemini 1.5 Pro was asked about LangChain concepts using a half-million-token input?

In Google AI Studio, the tester uploads merged LangChain documentation text and runs Gemini 1.5 Pro with about 526,000 tokens used out of a 1,048,576-token allowance. After roughly 50–60 seconds of inference, Gemini identifies LCEL as LangChain Expression Language—a declarative way to build LangChain chains—and it also finds LangGraph, described as a Python library for building and querying knowledge graphs via natural language. The response notes LangGraph is under development and not officially released, while also suggesting the documentation version may be out of date.

How did Gemini perform on the monkey business illusion video tasks?

Gemini is prompted to count passes by players in white and to detect other changes. The transcript reports Gemini answered 15 passes, while the “correct” value is suggested as 16. It also counts curtain color changes, described as occurring once in the clip. When asked what animal walks through the middle, Gemini identifies a gorilla, with the transcript noting that answer-revealing segments were removed so the model had to rely on the visual content.

What limitation is emphasized in the Andrew Ng slide extraction test?

The transcript stresses that the system processes the video visually rather than using the audio. Even though the video contains a speaker, Gemini is not treated as understanding the spoken words; instead, it extracts slide structure and content from the visuals. The tester reports it identifies slide chapters and provides descriptions for key slides, completing a 36-minute presentation in about two minutes.

Review Questions

How does a mixture of experts (MoE) architecture relate to scaling claims, and what does the transcript say about Gemini 1.5’s MoE status?
Why does a million-token context window change what kinds of retrieval tasks are feasible, based on the “needle in a haystack” discussion?
In the Andrew Ng example, what does the transcript say Gemini can and cannot use (audio vs. visuals), and how does that affect the results?

Key Points

1
Gemini 1.5 Pro is presented as a mixture of experts (MoE) model, publicly confirming long-running rumors about MoE-style architectures.
2
Gemini 1.5 Pro supports context windows up to 1,048,576 tokens, far beyond the 128,000-token baseline described for the standard configuration.
3
The million-token window is framed in practical terms: about an hour of video, around 11 hours of audio, ~30,000 lines of code, and ~700,000 words.
4
Long-context retrieval is tested via “needle in a haystack” style evaluations, with Google claiming targeted information can be found at token distances approaching one million.
5
In Google AI Studio, Gemini 1.5 Pro handled roughly 526,000 tokens of merged LangChain documentation, answering concept questions and generating a simple chain code example after a 50–60 second inference delay.
6
Video understanding tasks included scene outlining, selective-attention counting (monkey business illusion), and slide extraction from a presentation video using visual information rather than audio.
7
The workflow is described as still somewhat buggy, with UI and usability expected to improve as broader access arrives.

Highlights

Gemini 1.5 Pro’s context window scales to 1,048,576 tokens, enabling single-pass analysis of extremely large inputs like hour-long video or hundreds of thousands of words.

Google’s public confirmation that Gemini 1.5 uses a mixture of experts (MoE) architecture ties the model to a scaling approach previously seen in other frontier systems.

In a selective-attention test video, Gemini identifies a gorilla and estimates pass counts and curtain color changes with only small errors, despite the clip being under a minute.

Topics

Gemini 1.5 Pro
Million Token Context
Mixture of Experts
Long-Context Retrieval
Video and Slide Understanding

Mentioned

Google AI Studio
LangChain
LangGraph
LangChain Expression Language
Greg Kamradt
Andrew Ng
Sam Witteveen
MoE
LCEL