Get AI summaries of any video or article — Sign up free
Caught Distilling from Claude? thumbnail

Caught Distilling from Claude?

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Anthropic alleges 24,000 fake accounts generated 16 million exchanges to distill Claude-like capabilities, focusing on reasoning, tool use, and coding.

Briefing

A fresh wave of allegations claims Chinese AI labs are running large-scale “distillation attacks” to copy capabilities from Claude—using fleets of fake accounts to repeatedly query the model with near-identical prompts. Anthropic’s report, which focuses on detecting and preventing distillation, alleges 24,000 fake accounts generated 16 million exchanges aimed at extracting Claude’s strengths in reasoning, tool use, and coding. The accusation matters because it suggests a practical pathway for competitors to convert proprietary model behavior into trainable signals, potentially accelerating how quickly new models close the gap with frontier systems.

The timing is a central point of scrutiny. The claims surface just as multiple labs are releasing new models and as DeepSeek appears poised for its next major release. The transcript also links the broader moment to earlier market shock tied to DeepSeek’s emergence, arguing that the current burst of accusations may be more than coincidence—especially since the targeted labs named in Anthropic’s write-up (DeepSeek, Moonshot AI, and Miniax) are also raising their own capabilities in areas like coding and agentic workflows.

Anthropic’s breakdown attributes only 150,000 of the 16 million exchanges to DeepSeek, but frames DeepSeek’s approach as more “surgical.” The alleged goal is to extract reasoning abilities across tasks and to use Claude as a reward model for reinforcement learning—an “LLM-as-a-judge” style setup where outputs are graded against rubric-like criteria. The report also alleges attempts to learn Claude’s refusal behavior by generating “censorship-safe alternatives” to policy-sensitive prompts, effectively probing how the model handles boundary conditions.

Moonshot AI is accused of more than 3.4 million exchanges focused on general capabilities: reasoning, tool use, coding, data analysis, computer-use agent development, and computer vision. Miniax is accused of the largest share—over 13 million exchanges—again with emphasis on tool use, orchestration, and agentic coding. The transcript notes skepticism about targeting computer vision specifically, pointing out that Gemini models (and to a lesser extent GPT-family models) have tended to perform strongly in that area.

Beyond the distillation claims, the transcript highlights a parallel controversy: accusations that Anthropic itself scraped copyrighted books at scale. Elon Musk is mentioned as criticizing Anthropic for training on large volumes of books and for paying more than $1.5 billion to settle a copyright dispute. It also references a complex set of actions around books—buying physical copies, scanning them, and destroying the originals—to support a “one copy” argument. That backdrop fuels a broader debate about whether model outputs are copyrightable in a way that limits downstream training.

Finally, the transcript pivots to what distillation actually is, tracing it to a classic deep learning paper associated with Geoffrey Hinton and colleagues. Distillation typically trains a smaller model to imitate a larger one, often by learning from outputs or from logits (the full probability distribution over next tokens), not just the final predicted token. The transcript suggests that today’s best models may often be unavailable publicly, with smaller “flash” or “mini” models serving as distilled versions—making it harder to verify who distills whom. In that uncertainty, the accusations may recur, but the underlying question remains unresolved: whether scraping and training on proprietary model behavior (or outputs) is fair game, or a form of misappropriation.

Cornell Notes

Anthropic alleges that Chinese AI labs used large-scale distillation attacks to extract Claude capabilities. The report claims 24,000 fake accounts generated 16 million exchanges, targeting Claude’s reasoning, tool use, and coding. DeepSeek is accused of using Claude as a reward model for reinforcement learning and probing refusal behavior via “censorship-safe” prompt rewrites, while Moonshot AI and Miniax are accused of broader capability extraction focused on agentic coding and tool orchestration. The transcript also ties the accusations to ongoing disputes about training data and copyright, including claims that Anthropic faced major settlements over book-based training. Distillation itself is framed as a standard technique—training smaller models on a larger model’s outputs or logits—yet the ethics and legality of using proprietary model behavior remain contested.

What does Anthropic’s distillation-attack allegation claim, in concrete terms?

Anthropic’s report alleges 24,000 fake accounts were used to generate 16 million exchanges designed to extract Claude’s capabilities. The exchanges are described as being driven by clusters of accounts sending very similar prompts repeatedly, aiming to recover strengths in reasoning, tool use, and coding.

How is DeepSeek’s alleged strategy described differently from the other named labs?

DeepSeek is said to account for only about 150,000 of the 16 million exchanges, but the alleged tactics are framed as more targeted: extracting reasoning across tasks, using Claude as a reward model for reinforcement learning (LLM-as-a-judge/rubric grading), and creating “censorship-safe alternatives” to learn how Claude handles refusals.

What capabilities are Moonshot AI and Miniax accused of targeting?

Moonshot AI is accused of more than 3.4 million exchanges focused on general abilities such as reasoning, tool use, coding, data analysis, computer-use agent development, and computer vision. Miniax is accused of over 13 million exchanges, with emphasis on tool use, orchestration, and agentic coding.

Why does the transcript connect the allegations to broader copyright and training-data disputes?

It highlights that Anthropic has faced accusations related to training on copyrighted books, including a settlement referenced as $1.5 billion and claims about buying physical copies, scanning them, and destroying the originals. That context feeds a debate over whether scraping and training on proprietary model outputs/behavior is permissible, and whether outputs are copyrightable in a way that constrains downstream training.

What is distillation, and how does it relate to the allegations?

Distillation is framed as training a smaller model to imitate a larger one, often by learning from the larger model’s outputs and, in more advanced setups, from logits (the full probability distribution over next tokens). The transcript suggests that modern model ecosystems may rely heavily on distillation—especially when the largest models are not publicly served—making it difficult to verify whether competitors are copying via distillation or via other means.

Review Questions

  1. What specific mechanisms does Anthropic’s report attribute to distillation attacks (e.g., fake accounts, repeated similar prompts, reward-model grading, refusal probing)?
  2. How do logits-based distillation and output-based distillation differ, and why does that distinction matter for understanding what gets “copied”?
  3. Why does the transcript argue that timing and model releases make the allegations feel more than coincidental?

Key Points

  1. 1

    Anthropic alleges 24,000 fake accounts generated 16 million exchanges to distill Claude-like capabilities, focusing on reasoning, tool use, and coding.

  2. 2

    DeepSeek is accused of using Claude as a reward model for reinforcement learning and of probing refusal behavior through “censorship-safe” prompt rewrites.

  3. 3

    Moonshot AI is accused of more than 3.4 million exchanges targeting broad capabilities including tool use, coding, data analysis, agent development, and computer vision.

  4. 4

    Miniax is accused of over 13 million exchanges centered on tool use, orchestration, and agentic coding.

  5. 5

    The transcript links the distillation allegations to ongoing disputes about training data and copyright, including a referenced $1.5 billion Anthropic settlement over book-based training.

  6. 6

    Distillation is described as a standard technique for training smaller models from larger ones, often using logits rather than only final outputs.

  7. 7

    Verification is portrayed as difficult because the biggest models may be served via distilled smaller versions, obscuring direct lineage.

Highlights

Anthropic’s report claims 24,000 fake accounts produced 16 million exchanges to extract Claude’s reasoning, tool use, and coding abilities.
DeepSeek’s alleged method includes turning Claude into a reward model for reinforcement learning and learning refusal patterns via rewritten prompts.
Miniax is accused of the largest share—over 13 million exchanges—focused on tool orchestration and agentic coding.
Distillation is framed as learning from a larger model’s outputs or logits, which can make capability transfer technically straightforward even when the largest models aren’t public.

Topics

Mentioned