Raven - RWKV-7B RNN's LLM Strikes Back
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
RWKV combines transformer-style training (parallelizable, linear-scaling attention) with RNN-like inference (stateful generation).
Briefing
RWKV is a rare attempt to bring RNNs back into the large-language-model conversation by fixing the two biggest pain points that pushed transformers to dominance: slow, hard-to-parallelize training and limited context handling. Instead of treating recurrent networks as a dead end, RWKV (Receptive Weighted Key Values) uses a transformer-style training formulation that enables massive parallelization, while switching to an RNN-like stateful mechanism during inference. The result is a model that can be trained efficiently like a transformer—scaling attention linearly with the number of tokens—yet generate text with the long-horizon behavior associated with RNNs.
The core technical pitch is a “best of both worlds” design. During training, RWKV shares weights across two related formulations so the computation can be parallelized in the way transformer training benefits from. During inference, it behaves more like a conventional RNN: each step updates a state vector and produces the next token, which the transcript links to the ability to handle much longer contexts than typical transformer limits (often cited around 2,000–4,000 tokens in common setups). That long-context angle matters because it targets a practical weakness of many transformer deployments—where context windows cap how far back the model can reliably condition generation.
The project’s momentum also comes from real compute and funding. The transcript notes that training used substantial GPU resources—originally described as 64 A100s for roughly three months—and that later changes reduced the cost significantly. It also highlights that the work is driven largely by one person, with supporting documentation in blog posts and GitHub notebooks. A further practical detail: the code is Apache 2.0 open source, making it usable commercially “out of the box,” which lowers adoption friction for researchers and developers who want to experiment without licensing uncertainty.
On the experimentation side, the transcript walks through notebook runs in Google Colab using author-provided GitHub versions, including a 1.5B model and a larger 3B option that may require paid resources. The demos show that the base model can generate long, sometimes rambling outputs when prompted with instruction-like text, suggesting it isn’t fully instruction-tuned in the way many modern chat models are. A separate “chatbot” variant is described as hit-or-miss, sometimes answering factual prompts well but often drifting into completion-style repetition.
The most compelling results come from “Raven,” described as a 7B-class model fine-tuned on the Stanford Alpaca dataset. In the transcript, Raven produces coherent, fast responses on tasks like explaining what ravens are, writing a song, and interpreting a metaphor—performing comparably to other models discussed in the same timeframe. The overall takeaway is cautious but clear: RWKV and Raven offer a different direction from transformer-only scaling, even if the ecosystem’s hardware and tooling momentum still favors transformers. Still, the transcript frames RWKV as a model worth testing because it demonstrates that RNN-based language modeling can be trained and deployed in ways that address transformers’ bottlenecks—especially parallel training and longer context generation.
Cornell Notes
RWKV is a language-model architecture that revives RNNs by combining transformer-style training with RNN-like inference. Training is arranged to allow massive parallelization, while inference updates a recurrent state so generation can use longer effective context than many transformer setups. The project reports substantial training compute (64 A100s for about three months) and later optimizations that reduced cost. Raven is a fine-tuned 7B-class variant trained on the Stanford Alpaca dataset, and it performs well on instruction-like tasks such as explaining ravens, writing a song, and interpreting metaphors. The practical message: RWKV/Raven provide a credible alternative to transformer-only scaling, especially for long-context behavior and efficient training.
What problem made RNNs fall out of favor compared with transformers?
How does RWKV try to recover transformer-like training efficiency?
Why does RWKV inference resemble an RNN, and what does that buy?
What does the transcript say about training scale and cost?
How do the base model and Raven variants differ in behavior?
What practical options does the transcript mention for trying these models?
Review Questions
- What specific design change lets RWKV train with transformer-like parallelization while still using RNN-like generation at inference?
- Why does the transcript suggest Raven’s outputs are more instruction-like than the base model’s outputs?
- How does the transcript connect RWKV’s inference mechanism to longer context windows, and what context limits does it compare against?
Key Points
- 1
RWKV combines transformer-style training (parallelizable, linear-scaling attention) with RNN-like inference (stateful generation).
- 2
The architecture is positioned as a way to address RNNs’ historical training bottlenecks without giving up long-context behavior.
- 3
Training reportedly used 64 A100 GPUs for about three months, with later changes reducing cost substantially.
- 4
RWKV is released under Apache 2.0, enabling commercial use without licensing friction.
- 5
The base model can produce very long, sometimes repetitive completions, suggesting limited instruction tuning.
- 6
Raven is a 7B-class model fine-tuned on the Stanford Alpaca dataset and shows stronger instruction-following in demos.
- 7
Raven/RWKV are presented as worth experimenting with despite the broader industry’s transformer-optimized hardware ecosystem.