Testing Gemini 1.5 and a 1 Million Token Window
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini 1.5 Pro is presented as a mixture of experts (MoE) model, publicly confirming long-running rumors about MoE-style architectures.
Briefing
Gemini 1.5 Pro marks a major step up for long-context AI: it pairs a newly updated model with a dramatically expanded context window—up to 1,048,576 tokens—making it practical to analyze large collections of text, long documents, and even hours of media in a single run. The practical payoff is speed-to-insight on tasks that previously failed when relevant details were buried deep inside lengthy inputs.
Google positions Gemini 1.5 as a mixture of experts (MoE) architecture, publicly confirming what had long been rumored for other frontier models. In the same breath, it upgrades the context window story. A baseline Gemini 1.5 Pro configuration is described as supporting 128,000 tokens (roughly in line with other top-tier offerings), while a larger variant pushes to a million tokens. That scale translates into concrete use cases: about an hour of video, roughly 11 hours of audio, on the order of 30,000 lines of code, and around 700,000 words—enough to keep entire knowledge bases or large corpora “in view” during inference.
To test whether long-context retrieval is more than marketing, the transcript points to “needle in a haystack” evaluations—work associated with Greg Kamradt—where models struggle to locate a specific item buried in massive text. Google’s claim, as relayed here, is that Gemini can find such targeted information at token distances approaching the million-token range, outperforming many models that fail under similar conditions.
Hands-on testing begins in Google AI Studio using Gemini 1.5 Pro with a large uploaded dataset: merged LangChain documentation saved as text files. The setup shows the system can accept 1,048,576 tokens total, with the run consuming about 526,000 tokens. After an inference delay (about 50–60 seconds at this scale), Gemini identifies concepts like LCEL (LangChain Expression Language) and LangGraph. It also attempts code generation from the documentation, producing a LangChain-style snippet that references Gemini Pro and includes guidance such as where a Google API key should go.
Media understanding is then tested with uploaded videos. In a demo-style clip, Gemini produces scene-by-scene outlines—identifying a duck drawing, a country-guessing segment, a coin magic trick, rock-paper-scissors gestures, and a connect-the-dots puzzle—generally matching the visible events. A second experiment uses the “monkey business illusion” (selective attention test): Gemini estimates the number of passes by players in white (reported as 15, with the “correct” value suggested as 16), detects the gorilla in the middle, and counts curtain color changes (described as once). Finally, a longer Andrew Ng presentation video is processed visually (not via audio), and Gemini extracts slide “chapters” and summarizes the slides with reasonable consistency, completing a 36-minute deck in about two minutes.
The overall takeaway is a first practical look at what a million-token context enables: deeper retrieval, larger document ingestion, and more capable video/slide comprehension—though the workflow is still described as somewhat buggy, with UI rough edges likely to improve as access expands in the coming weeks.
Cornell Notes
Gemini 1.5 Pro combines an updated model with a much larger context window, up to 1,048,576 tokens. The transcript highlights two big shifts: public confirmation that Gemini 1.5 uses a mixture of experts (MoE) architecture, and a new long-context capability that supports tasks like “needle in a haystack” retrieval at very large token distances. In Google AI Studio, Gemini successfully answers questions about merged LangChain documentation using roughly half a million tokens, then generates a simple chain example. It also outlines scenes in a demo video, performs reasonably on the monkey business illusion (counting passes, spotting a gorilla, and detecting curtain color changes), and extracts slide structure from an Andrew Ng presentation video using visual information rather than audio. The result is a clearer sense of what million-token inputs unlock for real workflows.
What are the two headline upgrades in Gemini 1.5 Pro mentioned here, and why do they matter?
How large is the “million token” context in practical terms, according to the transcript?
What “needle in a haystack” test is referenced, and what does Google claim Gemini 1.5 can do?
What happened when Gemini 1.5 Pro was asked about LangChain concepts using a half-million-token input?
How did Gemini perform on the monkey business illusion video tasks?
What limitation is emphasized in the Andrew Ng slide extraction test?
Review Questions
- How does a mixture of experts (MoE) architecture relate to scaling claims, and what does the transcript say about Gemini 1.5’s MoE status?
- Why does a million-token context window change what kinds of retrieval tasks are feasible, based on the “needle in a haystack” discussion?
- In the Andrew Ng example, what does the transcript say Gemini can and cannot use (audio vs. visuals), and how does that affect the results?
Key Points
- 1
Gemini 1.5 Pro is presented as a mixture of experts (MoE) model, publicly confirming long-running rumors about MoE-style architectures.
- 2
Gemini 1.5 Pro supports context windows up to 1,048,576 tokens, far beyond the 128,000-token baseline described for the standard configuration.
- 3
The million-token window is framed in practical terms: about an hour of video, around 11 hours of audio, ~30,000 lines of code, and ~700,000 words.
- 4
Long-context retrieval is tested via “needle in a haystack” style evaluations, with Google claiming targeted information can be found at token distances approaching one million.
- 5
In Google AI Studio, Gemini 1.5 Pro handled roughly 526,000 tokens of merged LangChain documentation, answering concept questions and generating a simple chain code example after a 50–60 second inference delay.
- 6
Video understanding tasks included scene outlining, selective-attention counting (monkey business illusion), and slide extraction from a presentation video using visual information rather than audio.
- 7
The workflow is described as still somewhat buggy, with UI and usability expected to improve as broader access arrives.