Bad AI Predictions: Bard Upgrade, 2 Years to AI Auto-Money, OpenAI Investigation and more

TL;DR

Palm 2-based Bard is presented as delivering more human-like translation speech quality, including for languages like Swahili, and as outperforming Google Translate on quality.

Briefing Cornell Notes

Briefing

AI progress is moving faster than major forecasts from just a few years ago—especially in translation quality, image understanding, and reading comprehension—while predictions about “AI auto-money” and AGI timelines have compressed dramatically. The through-line is that tasks once labeled unsolved are now close enough to be useful, and that shift is happening on a timescale forecasters didn’t anticipate.

The update begins with a 2021 book, “A Brief History of AI,” which lists many capabilities as not yet solved and even admits uncertainty about how to make computers perform certain tasks. That pessimism is contrasted with recent demonstrations using Bard’s underlying Palm 2 model. In translation, Bard produces more human-sounding text-to-speech even for Swahili, and the claim is that Palm 2 outperforms Google Translate on quality. In multimodal reasoning, Bard is given a meme and returns an interpretation that recognizes the image as a pizza shaped like the Death Star, connects the toppings to the reference, and even explains the humor by contrasting “death and destruction” with “food and enjoyment.” Bard also incorporates Google Lens for real-world assistance—identifying objects on walks—though it can refuse to answer when a human face appears in the frame. A more striking anecdote is that an image taken in a local park was sometimes recognized as the park’s location, even when it wasn’t widely known.

The transcript then pivots to benchmark-style evidence. A 2021 forecast for a math dataset predicted scores rising from 21 (2023) to 52 (2025), with a projected 80 only in 2028. Current performance is described as already near that mark: GPT-4 is said to reach 78% on the dataset without code interpreter or Wolfram Alpha, and experiments using GPT-4 with code interpreter push results to around 86%. The same pattern appears in language tasks. A 112-page novel generated by GPT-4 is presented as “interesting” even if not human-level, and the claim is that fine-tuning on an author’s work could bring models close to producing convincing full-length stories. Claude 2 is used to refine vocabulary in the GPT-4 novel, replacing generic phrasing with more vivid terms like “crystalline,” “ethereal,” and “inaugurable.” Reading comprehension is reinforced with a GRE verbal benchmark: GPT-4 is described as scoring at the 99th percentile, and the narrator reports personal practice results.

Finally, the transcript tackles the biggest forecast whiplash: AGI timelines. In 2021, predictions placed AGI in the late 2030s or early 2040s, but now estimates are pulled forward to 2026. Mustafa Suleiman of Inflection AI is cited as pushing even further, suggesting an AI could be built to make one million dollars in as little as two years—assuming strategy, product research, and some human approval. That money-making scenario is framed as potentially transformative for the global economy.

Regulation and deployment risk enter via an FTC investigation into OpenAI, described as a detailed document focused on internal communications about hallucinations and privacy risks. The transcript suggests that if penalties follow, companies may become more cautious about publicly releasing models. It closes with the competitive acceleration of funding—Elon Musk’s xAI offering up to $200 million in signing bonuses—and a nod to iRobot’s question of whether machines can create art, now with a more optimistic ending.

Cornell Notes

Recent AI capability gains are outpacing forecasts made in 2021, with improvements showing up in translation, image/meme interpretation, math benchmarks, and reading comprehension. Palm 2-based Bard is described as producing more human-like text-to-speech for languages like Swahili and as interpreting images in ways that connect visual details to references and humor. On math, a benchmark forecast predicted a score of 80 only in 2028, yet GPT-4 is claimed to already reach 78% (and about 86% with code interpreter). Story and comprehension tests also look close: a GPT-4-generated novel is treated as plausibly “interesting,” Claude 2 can make vocabulary less generic, and GPT-4’s GRE verbal performance is cited as near the top percentile. The stakes rise with compressed AGI timelines and “AI auto-money” predictions, alongside regulatory pressure from an FTC investigation into OpenAI.

How does the transcript use translation and speech quality to argue that AI progress is faster than expected?

It contrasts a 2021 claim that many tasks are still unsolved with recent Bard demonstrations tied to Palm 2. The example isn’t just translation text—it’s the text-to-speech reading of a poem after translation into Spanish and even Swahili. The transcript emphasizes that the speech sounds “human-like” and claims Palm 2 improves translation quality over Google Translate, citing earlier coverage that Palm 2 raises quality relative to both its earlier version and Google Translate.

What evidence is offered that multimodal models can interpret images beyond basic labeling?

Bard is given a meme featuring a pizza shaped like the Death Star. The model is said to identify the pizza despite the unusual form, infer that the toppings contribute to the Death Star resemblance, and read the text in the meme. It then explains the humor by linking the Death Star’s symbolism (“death and destruction”) to the pizza’s association with food and enjoyment. The transcript also notes practical multimodal behavior via Google Lens integration—object identification on walks—while warning that human faces can trigger refusal to answer.

How does the transcript challenge a 2021 math forecast with benchmark numbers?

A 2021 forecast predicted a math dataset score of 21 in 2023, 52 in 2025, and 80 only in 2028. The transcript claims GPT-4 already achieves 78% on the dataset today without code interpreter or Wolfram Alpha. It also reports running hundreds of experiments using GPT-4 with code interpreter, reaching roughly 86% accuracy—suggesting the curve is steeper than forecasters projected.

What role do story-generation and vocabulary refinement play in the argument about near-term language capability?

A GPT-4-generated 112-page novel is presented as “interesting,” even if not human-level. The transcript argues that fine-tuning on an author’s work could make models “very very close” to producing high-quality novels. It then uses Claude 2 to process the GPT-4 novel: Claude 2 is asked to find sentences where vocabulary can be made less generic, and the transcript highlights specific replacements like “crystalline,” “ethereal,” and “inaugurable,” portraying this as a step toward more engaging prose.

Why does the transcript treat AGI timelines and “AI auto-money” as a major shift from earlier predictions?

It cites 2021-era AGI forecasts placing arrival in the late 2030s or early 2040s. It then contrasts that with newer expectations around 2026. Mustafa Suleiman (Inflection AI) is quoted as suggesting an AI could be built to make one million dollars in as little as two years, with the system handling strategy, research, and product design, potentially with limited human approval. The transcript frames this as a rapid economic transformation scenario if such an AI were deployed.

How does regulation enter the picture, and what deployment consequence is suggested?

The transcript references an FTC investigation into OpenAI, describing it as a detailed document focused on internal communications about hallucinations and privacy/inaccuracy risks. It notes that the investigation process could lead to large penalties. The implied consequence is that companies may become more reticent about publicly deploying models if enforcement results in significant costs.

Review Questions

Which specific benchmark forecast from 2021 is contradicted by current GPT-4 performance, and what numbers are given for both the forecast and the present results?
What multimodal tasks are demonstrated with Bard/Palm 2 (translation, image/meme interpretation, Google Lens use), and what limitations are mentioned (e.g., face handling)?
How does the transcript connect story-generation (novels, vocabulary refinement) to broader claims about reading comprehension and near-term capability gains?

Key Points

1
Palm 2-based Bard is presented as delivering more human-like translation speech quality, including for languages like Swahili, and as outperforming Google Translate on quality.
2
Bard’s multimodal capability is illustrated through meme interpretation that links visual details to references, reads embedded text, and explains humor.
3
A 2021 math benchmark forecast projected reaching a score of 80 in 2028, but GPT-4 is claimed to already be at 78% and about 86% with code interpreter.
4
Full-length story capability is framed as approaching usefulness: a GPT-4-generated 112-page novel is treated as “interesting,” and Claude 2 can make the prose less generic by swapping in more vivid vocabulary.
5
AGI timelines are described as compressing sharply—from late 2030s/early 2040s to around 2026—alongside “AI auto-money” predictions tied to Inflection AI’s Mustafa Suleiman.
6
Regulatory pressure from an FTC investigation into OpenAI is portrayed as a potential driver of greater caution about public model deployment.
7
Competitive momentum is reinforced by xAI’s reported $200 million signing bonuses for AI researchers.

Highlights

Palm 2-based Bard is credited with more human-like text-to-speech for translated content, including Swahili, and with translation quality that beats Google Translate.

A meme-based prompt leads Bard to identify a pizza shaped like the Death Star, read the meme text, and explain the humor using the reference’s symbolism.

A math benchmark forecast predicted 80 only in 2028, yet GPT-4 is claimed at 78% already, with code interpreter pushing results to roughly 86%.

AGI expectations are pulled forward dramatically: 2021 forecasts pointed to late 2030s/early 2040s, while newer estimates cluster around 2026 and even a two-year “make $1 million” scenario is floated.

An FTC investigation into OpenAI is framed as targeting internal communications about hallucinations and privacy/inaccuracy risks, with possible consequences for public deployment.

Topics

AI Forecasts
Palm 2
Multimodal Understanding
Math Benchmarks
AGI Timelines