Enter PaLM 2 (New Bard): Full Breakdown - 92 Pages Read and Gemini Before GPT 5? Google I/O

TL;DR

PaLM 2 is described as significantly smaller than the largest PaLM model (540B parameters) while still achieving competitive results on many tasks.

Briefing Cornell Notes

Briefing

Google’s PaLM 2 technical report and surrounding announcements position the model as a near-term rival to GPT-4—competitive on many benchmarks despite being smaller—while raising a separate, more uncomfortable question: Google is saying far less about training data, compute, and model internals than rivals, even as it markets Gemini’s “planning” and faster trajectory.

The most consequential claim is performance. Google describes PaLM 2 as “significantly smaller” than its largest PaLM model (540B parameters), yet it “dramatically” outperforms PaLM across tasks. The report withholds exact architecture and size details externally, but it also provides a compute-to-parameter scaling hint: an “optimal” parameter count between 100B and 200B given estimated training FLOPs. That range lands PaLM 2 in the neighborhood of GPT-3’s scale, while Google’s Bard is now powered by PaLM 2 and reportedly runs about 10x faster than GPT-4 for the same prompt—an outcome consistent with fewer parameters and cheaper serving.

Multilingual training emerges as PaLM 2’s differentiator. Google says it used a more multilingual and diverse pre-training mixture spanning hundreds of languages and domains (including programming and mathematics). The result, per the report’s comparisons, is that PaLM 2’s English performance is not dramatically ahead of other languages; in many cases, it does better outside English. Google even claims PaLM 2 can exceed Google Translate for certain languages and shows it passing mastery-style exams across Chinese, Japanese, Italian, French, Spanish, German, and more. That multilingual emphasis also shows up in how PaLM 2 is contrasted with GPT-4, which historically looks stronger in English.

Benchmark comparisons are also where fairness concerns surface. In the PaLM 2 vs GPT-4 head-to-head, Google uses chain-of-thought prompting with self-consistency—techniques that can boost accuracy by aggregating multiple reasoning attempts. If GPT-4 wasn’t evaluated with the same prompting strategy, the comparison could tilt in PaLM 2’s favor. Still, the report’s results suggest PaLM 2 can outperform GPT-4 on some high school math problems, while coding performance appears weaker: Google reports a pass@1 of 37.6 for its coding-oriented PaLM 2 variant, contrasted with much higher reported pass@1 for GPT-4 in other work.

Beyond raw capability, the report’s risk posture stands out. Google dedicates roughly 20 pages to bias, toxicity, and misgendering, but the transcript notes a conspicuous lack of broader “AI impacts” discussion compared with OpenAI’s more extensive treatment of societal monitoring and long-horizon risks. Google’s closest parallel is an emphasis on applications like universal translation for deepfakes and dubbing—useful, but also a reminder that powerful language tools can be repurposed.

All of this sets the stage for Gemini. Google says Gemini is still in training and is built for multimodality, tool/API integration, and future “memory and planning.” Planning is framed as a capability rather than a risk, even as the transcript links it to concerns about accelerated timelines and “acceleration risk.” Gemini is expected to run on TPU v5, with PaLM 2 using TPU v4, and Google’s internal reorganization under Google DeepMind is presented as a push toward faster, safer scaling. Meanwhile, medical claims—such as Med-PaLM 2 scoring 85 on the USMLE—illustrate the upside of rapid progress, even as policy and safety debates intensify in parallel.

Cornell Notes

PaLM 2 is positioned as a smaller-but-stronger alternative to GPT-4, with Google reporting competitive performance despite withholding exact model size and architecture details. Scaling hints suggest an “optimal” parameter range of 100B–200B based on estimated training FLOPs, aligning with faster, cheaper serving (Bard reportedly ~10x faster than GPT-4 for the same prompt). PaLM 2’s standout advantage is multilingual capability: English isn’t consistently dominant, and Google claims it can beat Google Translate for some languages and pass mastery-style exams across multiple language pairs. Comparisons with GPT-4 rely on chain-of-thought plus self-consistency, raising questions about evaluation fairness. The report also emphasizes bias/toxicity mitigation while devoting less space to broader AI societal impact, even as Gemini’s planning and acceleration are marketed as next steps.

What evidence suggests PaLM 2 could be competitive with GPT-4 despite being smaller?

Google describes PaLM 2 as “significantly smaller” than the largest PaLM model (540B parameters) while still outperforming PaLM on many tasks. It withholds external details on model size and architecture, but provides a compute-to-parameter scaling hint: an “optimal” parameter count between 100B and 200B given estimated training FLOPs. The transcript also notes Bard is powered by PaLM 2 and runs about 10x faster than GPT-4 for the same prompt, consistent with fewer parameters and cheaper inference.

Why does multilingual training matter in the PaLM 2 results?

Google says PaLM 2 used a more multilingual and diverse pre-training mixture across hundreds of languages and domains. The reported outcome is that PaLM 2’s English performance isn’t dramatically better than other languages; in many cases it performs better outside English. Google claims PaLM 2 can exceed Google Translate for certain languages and shows mastery-exam performance across languages such as Chinese, Japanese, Italian, French, Spanish, and German.

How might the PaLM 2 vs GPT-4 benchmark comparison be skewed?

The transcript highlights that Google’s PaLM 2 results used chain-of-thought prompting with self-consistency. Self-consistency aggregates multiple outputs to select the most consistent answer, which can raise accuracy. If GPT-4 wasn’t evaluated with the same prompting strategy, the head-to-head could be less directly comparable.

What does the report imply about coding performance relative to GPT-4?

Direct, fair comparisons are described as hard to find, but the coding-oriented PaLM 2 variant is reported with pass@1 of 37.6. The transcript contrasts this with much higher pass@1 figures reported for GPT-4 in other work, while noting that those higher numbers may be inflated if GPT-4 memorized or seen evaluation content.

What stands out about Google’s safety and risk discussion?

Google dedicates about 20 pages to bias, toxicity, and misgendering, including mention that some assessors were paid very low per-judgment rates. However, the transcript notes there wasn’t comparable depth on broader AI impacts. Instead, Google leans into application examples like universal translation for deepfakes and dubbing, which underscores capability while also hinting at misuse pathways.

How does Gemini fit into the capability-and-risk picture?

Gemini is described as still in training and built for multimodality, tool/API integration, and future memory and planning. Planning is treated as a selling point, while the transcript links it to concerns about long-term planning risks. Gemini is also expected to be accelerated from PaLM 2, potentially using TPU v5 (PaLM 2 reportedly used TPU v4), and Google’s reorganization under Google DeepMind is framed as enabling faster, safer scaling.

Review Questions

Which parts of the PaLM 2 report provide indirect evidence about parameter count, and what range do they suggest?
What role do chain-of-thought prompting and self-consistency play in interpreting PaLM 2 vs GPT-4 benchmark results?
How do the transcript’s comparisons of multilingual performance differ from the typical English-dominant pattern associated with GPT-4?

Key Points

1
PaLM 2 is described as significantly smaller than the largest PaLM model (540B parameters) while still achieving competitive results on many tasks.
2
Scaling hints in the report suggest an “optimal” parameter range of 100B–200B based on estimated training FLOPs.
3
Bard powered by PaLM 2 is reported to be about 10x faster than GPT-4 for the same prompt, implying cheaper, faster serving.
4
PaLM 2’s multilingual training is presented as a core advantage: English is not consistently the strongest language, and Google claims it can beat Google Translate for some languages.
5
Benchmark comparisons with GPT-4 may be affected by evaluation choices, including chain-of-thought prompting plus self-consistency for PaLM 2.
6
Google’s safety emphasis is heavy on bias/toxicity/misgendering, while broader AI societal impact receives less space than in some rival technical reporting.
7
Gemini is positioned as a faster, multimodal successor with planning and tool integration, raising separate concerns about acceleration and long-horizon behavior.

Highlights

PaLM 2’s reported performance is paired with a scaling-based parameter estimate (100B–200B), suggesting competitive capability without GPT-4-scale size.

Google’s multilingual framing flips the usual pattern: PaLM 2’s English results aren’t consistently dominant, and some languages outperform English.

The PaLM 2 vs GPT-4 comparison uses chain-of-thought plus self-consistency, which could materially change fairness of the head-to-head.

Google devotes extensive pages to bias/toxicity but comparatively less to broader AI impact, even as Gemini’s planning and acceleration are marketed.

Med-PaLM 2 is claimed to score 85 on the USMLE medical licensing exam, illustrating rapid progress alongside safety debates.

Topics

PaLM 2 Technical Report
Bard Speed
Multilingual Training
Gemini Planning
AI Safety Risks

Mentioned

Sundar Pichai
Sam Altman
Jeffrey Hinton
GPT-4
GPT-5
PaLM
FLOPs
TPU
USMLE
MMLU
AGI