Caught Distilling from Claude?
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Anthropic alleges 24,000 fake accounts generated 16 million exchanges to distill Claude-like capabilities, focusing on reasoning, tool use, and coding.
Briefing
A fresh wave of allegations claims Chinese AI labs are running large-scale “distillation attacks” to copy capabilities from Claude—using fleets of fake accounts to repeatedly query the model with near-identical prompts. Anthropic’s report, which focuses on detecting and preventing distillation, alleges 24,000 fake accounts generated 16 million exchanges aimed at extracting Claude’s strengths in reasoning, tool use, and coding. The accusation matters because it suggests a practical pathway for competitors to convert proprietary model behavior into trainable signals, potentially accelerating how quickly new models close the gap with frontier systems.
The timing is a central point of scrutiny. The claims surface just as multiple labs are releasing new models and as DeepSeek appears poised for its next major release. The transcript also links the broader moment to earlier market shock tied to DeepSeek’s emergence, arguing that the current burst of accusations may be more than coincidence—especially since the targeted labs named in Anthropic’s write-up (DeepSeek, Moonshot AI, and Miniax) are also raising their own capabilities in areas like coding and agentic workflows.
Anthropic’s breakdown attributes only 150,000 of the 16 million exchanges to DeepSeek, but frames DeepSeek’s approach as more “surgical.” The alleged goal is to extract reasoning abilities across tasks and to use Claude as a reward model for reinforcement learning—an “LLM-as-a-judge” style setup where outputs are graded against rubric-like criteria. The report also alleges attempts to learn Claude’s refusal behavior by generating “censorship-safe alternatives” to policy-sensitive prompts, effectively probing how the model handles boundary conditions.
Moonshot AI is accused of more than 3.4 million exchanges focused on general capabilities: reasoning, tool use, coding, data analysis, computer-use agent development, and computer vision. Miniax is accused of the largest share—over 13 million exchanges—again with emphasis on tool use, orchestration, and agentic coding. The transcript notes skepticism about targeting computer vision specifically, pointing out that Gemini models (and to a lesser extent GPT-family models) have tended to perform strongly in that area.
Beyond the distillation claims, the transcript highlights a parallel controversy: accusations that Anthropic itself scraped copyrighted books at scale. Elon Musk is mentioned as criticizing Anthropic for training on large volumes of books and for paying more than $1.5 billion to settle a copyright dispute. It also references a complex set of actions around books—buying physical copies, scanning them, and destroying the originals—to support a “one copy” argument. That backdrop fuels a broader debate about whether model outputs are copyrightable in a way that limits downstream training.
Finally, the transcript pivots to what distillation actually is, tracing it to a classic deep learning paper associated with Geoffrey Hinton and colleagues. Distillation typically trains a smaller model to imitate a larger one, often by learning from outputs or from logits (the full probability distribution over next tokens), not just the final predicted token. The transcript suggests that today’s best models may often be unavailable publicly, with smaller “flash” or “mini” models serving as distilled versions—making it harder to verify who distills whom. In that uncertainty, the accusations may recur, but the underlying question remains unresolved: whether scraping and training on proprietary model behavior (or outputs) is fair game, or a form of misappropriation.
Cornell Notes
Anthropic alleges that Chinese AI labs used large-scale distillation attacks to extract Claude capabilities. The report claims 24,000 fake accounts generated 16 million exchanges, targeting Claude’s reasoning, tool use, and coding. DeepSeek is accused of using Claude as a reward model for reinforcement learning and probing refusal behavior via “censorship-safe” prompt rewrites, while Moonshot AI and Miniax are accused of broader capability extraction focused on agentic coding and tool orchestration. The transcript also ties the accusations to ongoing disputes about training data and copyright, including claims that Anthropic faced major settlements over book-based training. Distillation itself is framed as a standard technique—training smaller models on a larger model’s outputs or logits—yet the ethics and legality of using proprietary model behavior remain contested.
What does Anthropic’s distillation-attack allegation claim, in concrete terms?
How is DeepSeek’s alleged strategy described differently from the other named labs?
What capabilities are Moonshot AI and Miniax accused of targeting?
Why does the transcript connect the allegations to broader copyright and training-data disputes?
What is distillation, and how does it relate to the allegations?
Review Questions
- What specific mechanisms does Anthropic’s report attribute to distillation attacks (e.g., fake accounts, repeated similar prompts, reward-model grading, refusal probing)?
- How do logits-based distillation and output-based distillation differ, and why does that distinction matter for understanding what gets “copied”?
- Why does the transcript argue that timing and model releases make the allegations feel more than coincidental?
Key Points
- 1
Anthropic alleges 24,000 fake accounts generated 16 million exchanges to distill Claude-like capabilities, focusing on reasoning, tool use, and coding.
- 2
DeepSeek is accused of using Claude as a reward model for reinforcement learning and of probing refusal behavior through “censorship-safe” prompt rewrites.
- 3
Moonshot AI is accused of more than 3.4 million exchanges targeting broad capabilities including tool use, coding, data analysis, agent development, and computer vision.
- 4
Miniax is accused of over 13 million exchanges centered on tool use, orchestration, and agentic coding.
- 5
The transcript links the distillation allegations to ongoing disputes about training data and copyright, including a referenced $1.5 billion Anthropic settlement over book-based training.
- 6
Distillation is described as a standard technique for training smaller models from larger ones, often using logits rather than only final outputs.
- 7
Verification is portrayed as difficult because the biggest models may be served via distilled smaller versions, obscuring direct lineage.