Get AI summaries of any video or article — Sign up free
AI - 2024AD: 212-page Report (from this morning) Fully Read w/ Highlights thumbnail

AI - 2024AD: 212-page Report (from this morning) Fully Read w/ Highlights

AI Explained·
5 min read

Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Frontier models are increasingly converging in capability because of overlapping pre-training data, shifting competition toward scaling and deployment speed.

Briefing

A six-year “State of AI” report released by Andreessen Horowitz (a16z) Capital frames 2024 as a year when leading models stopped feeling like separate species and started converging—while costs, compute, and multimodal capability kept accelerating. The report’s headline theme is that major systems are increasingly overlapping in what they can do because they’re trained on similar large-scale data, pushing models such as Claude 3.5, Sonnet, Grok 2, and Gemini 1.5 toward GPT-4–level behavior rather than diverging into fundamentally different approaches. That convergence matters because it shifts the competitive question from “who has the best architecture” to “who can scale reliably, ship safely, and monetize faster.”

The report also revisits earlier forecasts and measures them against reality. One prediction—about spending more than a billion dollars to train a single frontier model—was judged harshly at the time, but later reporting suggests the scale is now firmly in the billions. OpenAI’s projected 2024 training costs were cited as roughly $3 billion for frontier-model training, excluding additional compute used for research and iteration. The same reporting also pointed to a longer runway for profitability, with internal sources suggesting OpenAI may not turn a profit until 2029. The implication is that the frontier race is not just about breakthroughs; it’s about sustained, expensive compute cycles.

Multimodality is treated as a practical turning point rather than a novelty. Meta’s Movie Gen is highlighted for generating audio alongside video, and the transcript contrasts that paper-based progress with tools that let users create short clips immediately—upload an image, choose an effect, and generate “melt,” “explode,” or “squish” style transformations. The report’s broader point is that models are moving from single-task text generation into systems that can manipulate multiple media streams in one workflow.

Science and medicine are also pulled into the spotlight. The Nobel Prize in Physics and Chemistry is mentioned in connection with neural-network-driven work (including AlphaFold), reinforcing a narrative that AI is increasingly “eating science” by accelerating discovery. One standout example from the report is “brain LM,” a transformer-based model that can be fine-tuned to predict clinical variables such as age and anxiety disorders from brain activity, and potentially support in-silico medication testing by simulating biologically meaningful responses.

Hardware and scaling dynamics get their own section. A chart tracks how quickly Nvidia data center GPUs are released and how teraflops per chip rise over time, while also noting that clustering more GPUs across data centers multiplies the effective compute. The transcript ties this to a belief that the next two years of progress are largely “baked in” by sheer scale.

Yet the report doesn’t treat risks as solved. Jailbreaking remains an active problem, with stealth attacks and compromised instruction hierarchies reported to persist for hours. A taxonomy of real-world misuse is used to ground concerns in current harms—especially impersonation scams and non-consensual intimate image generation—rather than hypothetical future catastrophes.

Finally, the report’s next-year predictions are presented as a mix of firm and vague claims. The transcript’s own critique focuses on how hard it is to measure “meaningful changes” or “breakout status” without clear metrics. It ends with a forward-looking bet: that the next major leap will come from models that break state-of-the-art performance across multiple modalities at once, alongside continued rapid scaling and valuation growth.

Cornell Notes

The report’s core message is that frontier AI models are converging in capability because they share heavy overlap in pre-training data, making the competitive edge less about novelty and more about scaling, shipping, and cost control. It places multimodality at the center of 2024 progress, highlighting systems that generate or transform multiple media types and pointing to tools that make these capabilities immediately usable. It also connects AI to science and medicine, including “brain LM,” which can predict clinical variables from brain activity and potentially support in-silico testing. Despite momentum, the report stresses that safety work is not “done”: jailbreaking techniques still work, and real-world misuse patterns remain concentrated in impersonation and non-consensual intimate imagery. The outlook for 2025 mixes measurable predictions with vague language, making outcomes harder to verify.

What does “model convergence” mean in practical terms, and why does it matter for competition?

Convergence here refers to major model families—trained on overlapping large-scale data—ending up with similar capabilities. The transcript links this to heavy overlap in pre-training datasets, suggesting that systems like Claude 3.5, Grok 2, and Gemini 1.5 are catching up to GPT-4–level behavior. That matters because it reduces the advantage of being “different” and increases the advantage of being faster and cheaper to scale, iterate, and deploy.

How do the report’s cost figures change the way you think about the frontier race?

The transcript cites reporting that OpenAI’s projected 2024 training costs are about $3 billion for frontier-model training, excluding research compute amortization and additional iteration on smaller models. It further suggests the full “01” stack (base model, generator, and fine-tuned final model) could plausibly exceed $1 billion. It also mentions a much larger annual figure for 2026—nearly $10 billion in training costs—plus over $5 billion in research compute. The takeaway: progress is constrained by sustained compute budgets, not just ideas.

Why is multimodality treated as more than a feature in 2024?

Multimodality is framed as a shift in what models can do end-to-end: generate audio alongside video, and transform media with user-provided inputs. Movie Gen is highlighted for producing audio and video together, while tool-based workflows (upload an image, pick an effect like melt/explode/squish) show how these capabilities are becoming interactive and product-ready.

What safety message comes through most strongly?

Jailbreaking is described as unsolved. The transcript summarizes page-after-page evidence that many jailbreak techniques still work—stealth attacks, sleeper agents, and compromised instruction hierarchies can persist for hours. It also argues that the focus may be misaligned: harms are already happening now (impersonation scams and non-consensual intimate images), even if future AGI misuse could look different.

How does the report connect AI to scientific discovery and medical applications?

It links AI to Nobel-recognized breakthroughs (via neural-network-driven work like AlphaFold) and then drills into a medical example: “brain LM.” The transcript says brain LM can be fine-tuned to predict clinical variables such as age and anxiety disorders from brain activity, uses a transformer architecture with self-supervised training (masking future states), and may enable in-silico experimentation—simulating brain responses to test medications for depression without starting with patients.

What does the transcript suggest about measuring next-year predictions?

It criticizes predictions that rely on qualitative terms like “meaningful,” “breakout,” or “challengers failing to make any meaningful dent,” because those lack clear metrics. It contrasts that with more objective claims (e.g., a research paper generated by an AI scientist being accepted at a major ML conference), which the transcript doubts due to disclosure/ethical framing and likely conference policies.

Review Questions

  1. Which factors in the transcript are used to explain why leading models are converging in capability?
  2. What evidence is cited that jailbreaking remains an active, unsolved problem?
  3. How does “brain LM” differ from typical text-only models, and what medical use cases does it suggest?

Key Points

  1. 1

    Frontier models are increasingly converging in capability because of overlapping pre-training data, shifting competition toward scaling and deployment speed.

  2. 2

    OpenAI’s training costs are described as multi-billion-dollar at the frontier level, with additional research compute and iteration costs pushing total spend higher.

  3. 3

    Multimodality is moving from research novelty to product workflows, with systems generating or transforming audio and video together.

  4. 4

    AI is being tied to scientific acceleration and medical prediction, including brain LM’s ability to predict clinical variables from brain activity and support in-silico testing.

  5. 5

    Safety remains unresolved: jailbreaking techniques still work, and real-world misuse is concentrated in impersonation and non-consensual intimate imagery.

  6. 6

    Hardware release cadence and rising per-chip compute (plus larger GPU clustering) are presented as major drivers of near-term progress.

  7. 7

    Next-year predictions are harder to verify when they rely on vague terms rather than measurable benchmarks.

Highlights

Leading models are portrayed as converging toward GPT-4–level behavior due to shared training-data overlap, making differentiation harder and execution more important.
OpenAI’s projected frontier training costs are cited around $3 billion for 2024, with much larger multi-year compute budgets implied for continued scaling.
Movie Gen is highlighted as generating audio alongside video, while interactive tools let users create short multimodal clips immediately.
brain LM is singled out as a transformer-based model that can predict clinical variables from brain activity and potentially enable in-silico medication testing.
Jailbreaking is described as far from solved, with stealth attacks and compromised instruction hierarchies reported to persist for hours.

Topics

Mentioned