OpenAI Is Actually Terrible

TL;DR

OpenAI’s distillation-related complaints are portrayed as hypocritical when set against allegations that OpenAI trained on data and code with potentially incompatible licenses.

Briefing Cornell Notes

Briefing

OpenAI’s public complaints about DeepSeek R1 are framed as hypocrisy: the same company that relies on large-scale training data and model distillation is portrayed as arguing that others’ use of similar techniques violates rules. The transcript points to OpenAI’s Terms of Service—specifically claims that commercial “distillation” is not allowed—then contrasts that with alleged past behavior: training on GPL-licensed code, scraping GitHub in ways that could break licenses, using images with unclear permissions, and drawing from Twitter data that may not extend to third-party training. The core claim is that OpenAI’s stance is “terrible” not because copyright law is settled, but because it targets competitors while benefiting from the very ecosystem of data and reuse it criticizes.

That tension is then broadened into a policy argument about copyright and AI outputs. A U.S. Copyright Office report is cited as saying existing copyright principles can flex to cover generative AI, with protection for AI outputs only when a human author contributes “sufficient expressive elements.” The transcript treats this as internally inconsistent: if generative systems can ingest and remix copyrighted material, then later producing something that can be copyrighted seems to undermine the logic of ownership. The discussion leans on a common internet grievance—people being sued or criticized for actions they themselves rely on—suggesting a future where everyone is entangled in litigation while practical enforcement remains unclear.

The transcript also pushes back on the idea that DeepSeek R1 proves AI is “over” or that future models will be cheap with no major investment. It argues that R1’s low cost (the transcript throws out a figure like “5.5 million,” while also implying the number may be exaggerated) still depends on expensive groundwork: R1 is described as distilling or “drafting off” a model that cost “billions” to build. The key logic is “no free lunch”—a cheap model can’t exist without costly training or access to a costly base model. In that framing, R1’s efficiency is real but not magical: it’s the result of performance work, including writing its own version of CUDA rather than using CUDA directly, and optimizing the inference path.

Finally, the transcript treats DeepSeek’s behavior as evidence of how LLMs work under the hood. When asked “who” a model is, it may answer incorrectly (e.g., returning “ChatGPT” variants), and the transcript attributes that to next-token prediction: the model generates the most likely continuation, and repeated jokes or expectations can create a self-fulfilling pattern in outputs. Overall, the message is less about whether AI can be regulated and more about how incentives, licensing, and model economics collide—producing both legal confusion and competitive posturing.

Cornell Notes

The transcript argues that OpenAI’s complaints about DeepSeek R1 are hypocritical, pointing to alleged past training practices (including potentially license-violating code and image/data scraping) while OpenAI claims distillation for commercial use should be illegal. It then cites a U.S. Copyright Office report saying generative AI outputs can be copyrighted only when a human author adds sufficient expressive elements, and questions how that fits with the reality of AI remixing copyrighted inputs. On the competition side, it rejects the idea that R1 proves AI will become cheap overnight, claiming R1’s performance likely depends on distilling from a much more expensive “multi-billion” model. It also explains odd identity answers as normal behavior for next-token prediction, where expected continuations (including jokes) can shape outputs.

Why does the transcript call OpenAI’s position on distillation “hypocritical”?

It contrasts OpenAI’s Terms of Service claim that commercial distillation is illegal with allegations that OpenAI previously trained on large amounts of third-party material—such as GPL-licensed code, GitHub content, images with unclear licensing, and data from Twitter that the transcript argues wasn’t authorized for third-party training. The point isn’t only that distillation is disputed; it’s that OpenAI is portrayed as benefiting from the same data-reuse ecosystem while criticizing competitors for using similar techniques.

What does the U.S. Copyright Office report claim about copyright for generative AI outputs?

The transcript cites the report’s conclusion that existing copyright principles are flexible enough to apply to new AI technology, but protection for generative AI outputs requires a human author to have determined “sufficient expressive elements.” In other words, copyright protection hinges on human creative contribution rather than treating AI output as automatically protected.

Why does the transcript question the logic of copyright protection for AI outputs?

It treats the policy as hard to reconcile with how generative AI works: if models can “break” or bypass copyright through training and then later produce outputs that can be copyrighted, the transcript frames that as undermining ownership rules. It uses an analogy of someone taking a ball and claiming they created it, arguing that the chain from input to output feels like stolen work followed by a new claim of rights.

What economic argument does the transcript make against the idea that R1 proves AI is “cheap now”?

It argues that even if R1’s reported cost is low (the transcript mentions a figure like “5.5 million,” while suggesting it may be exaggerated), R1 likely depends on distilling from a far more expensive base model built at “billions” scale. The core claim is “no free lunch”: a cheap model still requires costly training or access to costly training.

What technical detail does the transcript highlight about R1’s efficiency?

It credits R1 with performance optimization, including not using CUDA directly and instead “raw dog[ging]” its own version of it. The transcript frames this as a deliberate engineering choice to improve inference performance, implying that efficiency comes from substantial systems work rather than only from cheaper training.

How does the transcript explain why models may answer “who they are” incorrectly?

It attributes identity mistakes to next-token prediction: an LLM predicts the most likely next output token given the prompt. If the training data or internet jokes repeatedly associate a model with a particular identity (e.g., “ChatGPT”), the model may produce that answer even when it’s wrong. The transcript describes this as a self-fulfilling expectation: more variants of the joke increase the likelihood of the same continuation.

Review Questions

How does the transcript connect OpenAI’s Terms of Service claims to alleged past training practices, and what does that imply about competitive fairness?
What conditions does the transcript cite for copyright protection of generative AI outputs, and why does it consider the logic inconsistent?
According to the transcript’s “no free lunch” argument, what must exist behind a low-cost model like R1?

Key Points

1
OpenAI’s distillation-related complaints are portrayed as hypocritical when set against allegations that OpenAI trained on data and code with potentially incompatible licenses.
2
A U.S. Copyright Office report is cited as saying generative AI outputs can be copyrighted only when a human author contributes sufficient expressive elements.
3
The transcript argues that copyright rules become hard to reconcile with how generative models ingest and remix copyrighted material.
4
DeepSeek R1’s low cost is framed as dependent on distilling from a much more expensive, multi-billion-dollar base model rather than being evidence that AI training is now trivial.
5
R1 is credited with engineering optimizations, including building its own alternative to CUDA for performance.
6
Odd “identity” answers are explained as expected behavior from next-token prediction, especially when internet expectations and jokes shape likely continuations.

Highlights

The transcript’s central charge is hypocrisy: OpenAI criticizes commercial distillation while allegedly benefiting from large-scale reuse of third-party code, images, and social data.

Copyright protection for generative AI outputs is described as conditional on human authorship, but the transcript treats that as logically at odds with training-time copying and remixing.

R1’s efficiency is framed as real engineering work and distillation economics—not proof that future models can be built cheaply from scratch.

LLM “who am I?” mistakes are attributed to next-token prediction and the way repeated internet associations can steer outputs. 

Topics

OpenAI and DeepSeek rivalry
Copyright and AI outputs
Distillation and licensing
Model economics
Next-Token Prediction

Mentioned

GPL
LLM
CUDA