Enter PaLM 2 (New Bard): Full Breakdown - 92 Pages Read and Gemini Before GPT 5? Google I/O
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
PaLM 2 is described as significantly smaller than the largest PaLM model (540B parameters) while still achieving competitive results on many tasks.
Briefing
Google’s PaLM 2 technical report and surrounding announcements position the model as a near-term rival to GPT-4—competitive on many benchmarks despite being smaller—while raising a separate, more uncomfortable question: Google is saying far less about training data, compute, and model internals than rivals, even as it markets Gemini’s “planning” and faster trajectory.
The most consequential claim is performance. Google describes PaLM 2 as “significantly smaller” than its largest PaLM model (540B parameters), yet it “dramatically” outperforms PaLM across tasks. The report withholds exact architecture and size details externally, but it also provides a compute-to-parameter scaling hint: an “optimal” parameter count between 100B and 200B given estimated training FLOPs. That range lands PaLM 2 in the neighborhood of GPT-3’s scale, while Google’s Bard is now powered by PaLM 2 and reportedly runs about 10x faster than GPT-4 for the same prompt—an outcome consistent with fewer parameters and cheaper serving.
Multilingual training emerges as PaLM 2’s differentiator. Google says it used a more multilingual and diverse pre-training mixture spanning hundreds of languages and domains (including programming and mathematics). The result, per the report’s comparisons, is that PaLM 2’s English performance is not dramatically ahead of other languages; in many cases, it does better outside English. Google even claims PaLM 2 can exceed Google Translate for certain languages and shows it passing mastery-style exams across Chinese, Japanese, Italian, French, Spanish, German, and more. That multilingual emphasis also shows up in how PaLM 2 is contrasted with GPT-4, which historically looks stronger in English.
Benchmark comparisons are also where fairness concerns surface. In the PaLM 2 vs GPT-4 head-to-head, Google uses chain-of-thought prompting with self-consistency—techniques that can boost accuracy by aggregating multiple reasoning attempts. If GPT-4 wasn’t evaluated with the same prompting strategy, the comparison could tilt in PaLM 2’s favor. Still, the report’s results suggest PaLM 2 can outperform GPT-4 on some high school math problems, while coding performance appears weaker: Google reports a pass@1 of 37.6 for its coding-oriented PaLM 2 variant, contrasted with much higher reported pass@1 for GPT-4 in other work.
Beyond raw capability, the report’s risk posture stands out. Google dedicates roughly 20 pages to bias, toxicity, and misgendering, but the transcript notes a conspicuous lack of broader “AI impacts” discussion compared with OpenAI’s more extensive treatment of societal monitoring and long-horizon risks. Google’s closest parallel is an emphasis on applications like universal translation for deepfakes and dubbing—useful, but also a reminder that powerful language tools can be repurposed.
All of this sets the stage for Gemini. Google says Gemini is still in training and is built for multimodality, tool/API integration, and future “memory and planning.” Planning is framed as a capability rather than a risk, even as the transcript links it to concerns about accelerated timelines and “acceleration risk.” Gemini is expected to run on TPU v5, with PaLM 2 using TPU v4, and Google’s internal reorganization under Google DeepMind is presented as a push toward faster, safer scaling. Meanwhile, medical claims—such as Med-PaLM 2 scoring 85 on the USMLE—illustrate the upside of rapid progress, even as policy and safety debates intensify in parallel.
Cornell Notes
PaLM 2 is positioned as a smaller-but-stronger alternative to GPT-4, with Google reporting competitive performance despite withholding exact model size and architecture details. Scaling hints suggest an “optimal” parameter range of 100B–200B based on estimated training FLOPs, aligning with faster, cheaper serving (Bard reportedly ~10x faster than GPT-4 for the same prompt). PaLM 2’s standout advantage is multilingual capability: English isn’t consistently dominant, and Google claims it can beat Google Translate for some languages and pass mastery-style exams across multiple language pairs. Comparisons with GPT-4 rely on chain-of-thought plus self-consistency, raising questions about evaluation fairness. The report also emphasizes bias/toxicity mitigation while devoting less space to broader AI societal impact, even as Gemini’s planning and acceleration are marketed as next steps.
What evidence suggests PaLM 2 could be competitive with GPT-4 despite being smaller?
Why does multilingual training matter in the PaLM 2 results?
How might the PaLM 2 vs GPT-4 benchmark comparison be skewed?
What does the report imply about coding performance relative to GPT-4?
What stands out about Google’s safety and risk discussion?
How does Gemini fit into the capability-and-risk picture?
Review Questions
- Which parts of the PaLM 2 report provide indirect evidence about parameter count, and what range do they suggest?
- What role do chain-of-thought prompting and self-consistency play in interpreting PaLM 2 vs GPT-4 benchmark results?
- How do the transcript’s comparisons of multilingual performance differ from the typical English-dominant pattern associated with GPT-4?
Key Points
- 1
PaLM 2 is described as significantly smaller than the largest PaLM model (540B parameters) while still achieving competitive results on many tasks.
- 2
Scaling hints in the report suggest an “optimal” parameter range of 100B–200B based on estimated training FLOPs.
- 3
Bard powered by PaLM 2 is reported to be about 10x faster than GPT-4 for the same prompt, implying cheaper, faster serving.
- 4
PaLM 2’s multilingual training is presented as a core advantage: English is not consistently the strongest language, and Google claims it can beat Google Translate for some languages.
- 5
Benchmark comparisons with GPT-4 may be affected by evaluation choices, including chain-of-thought prompting plus self-consistency for PaLM 2.
- 6
Google’s safety emphasis is heavy on bias/toxicity/misgendering, while broader AI societal impact receives less space than in some rival technical reporting.
- 7
Gemini is positioned as a faster, multimodal successor with planning and tool integration, raising separate concerns about acceleration and long-horizon behavior.