Raven - RWKV-7B RNN's LLM Strikes Back

TL;DR

RWKV combines transformer-style training (parallelizable, linear-scaling attention) with RNN-like inference (stateful generation).

Briefing Cornell Notes

Briefing

RWKV is a rare attempt to bring RNNs back into the large-language-model conversation by fixing the two biggest pain points that pushed transformers to dominance: slow, hard-to-parallelize training and limited context handling. Instead of treating recurrent networks as a dead end, RWKV (Receptive Weighted Key Values) uses a transformer-style training formulation that enables massive parallelization, while switching to an RNN-like stateful mechanism during inference. The result is a model that can be trained efficiently like a transformer—scaling attention linearly with the number of tokens—yet generate text with the long-horizon behavior associated with RNNs.

The core technical pitch is a “best of both worlds” design. During training, RWKV shares weights across two related formulations so the computation can be parallelized in the way transformer training benefits from. During inference, it behaves more like a conventional RNN: each step updates a state vector and produces the next token, which the transcript links to the ability to handle much longer contexts than typical transformer limits (often cited around 2,000–4,000 tokens in common setups). That long-context angle matters because it targets a practical weakness of many transformer deployments—where context windows cap how far back the model can reliably condition generation.

The project’s momentum also comes from real compute and funding. The transcript notes that training used substantial GPU resources—originally described as 64 A100s for roughly three months—and that later changes reduced the cost significantly. It also highlights that the work is driven largely by one person, with supporting documentation in blog posts and GitHub notebooks. A further practical detail: the code is Apache 2.0 open source, making it usable commercially “out of the box,” which lowers adoption friction for researchers and developers who want to experiment without licensing uncertainty.

On the experimentation side, the transcript walks through notebook runs in Google Colab using author-provided GitHub versions, including a 1.5B model and a larger 3B option that may require paid resources. The demos show that the base model can generate long, sometimes rambling outputs when prompted with instruction-like text, suggesting it isn’t fully instruction-tuned in the way many modern chat models are. A separate “chatbot” variant is described as hit-or-miss, sometimes answering factual prompts well but often drifting into completion-style repetition.

The most compelling results come from “Raven,” described as a 7B-class model fine-tuned on the Stanford Alpaca dataset. In the transcript, Raven produces coherent, fast responses on tasks like explaining what ravens are, writing a song, and interpreting a metaphor—performing comparably to other models discussed in the same timeframe. The overall takeaway is cautious but clear: RWKV and Raven offer a different direction from transformer-only scaling, even if the ecosystem’s hardware and tooling momentum still favors transformers. Still, the transcript frames RWKV as a model worth testing because it demonstrates that RNN-based language modeling can be trained and deployed in ways that address transformers’ bottlenecks—especially parallel training and longer context generation.

Cornell Notes

RWKV is a language-model architecture that revives RNNs by combining transformer-style training with RNN-like inference. Training is arranged to allow massive parallelization, while inference updates a recurrent state so generation can use longer effective context than many transformer setups. The project reports substantial training compute (64 A100s for about three months) and later optimizations that reduced cost. Raven is a fine-tuned 7B-class variant trained on the Stanford Alpaca dataset, and it performs well on instruction-like tasks such as explaining ravens, writing a song, and interpreting metaphors. The practical message: RWKV/Raven provide a credible alternative to transformer-only scaling, especially for long-context behavior and efficient training.

What problem made RNNs fall out of favor compared with transformers?

RNNs—especially LSTMs—were slow to train because they processed sequences in a way that limited parallelization. Training often proceeded token-by-token, which made GPU utilization and throughput worse than transformer training. Even when attention ideas like Bahdanau attention brought RNNs closer to transformer-like capabilities, the training and parallelization drawbacks remained.

How does RWKV try to recover transformer-like training efficiency?

RWKV uses a transformer-type formulation during training with shared weights across two related architectures. That setup enables massive parallelization and uses an attention mechanism that scales linearly with the number of tokens, aligning more closely with how transformers achieve efficient batched training.

Why does RWKV inference resemble an RNN, and what does that buy?

During inference, RWKV switches to a stateful mechanism that behaves more like a normal RNN: it carries a state vector forward and generates the next token from that state. The transcript links this design to longer context handling, suggesting RWKV isn’t constrained to the shorter context windows commonly cited for many transformer deployments (often around 2,000–4,000 tokens).

What does the transcript say about training scale and cost?

Training reportedly used 64 A100 GPUs for about three months in the original setup. The transcript also notes that later changes reduced training cost “by a lot,” implying that the architecture and/or training procedure became more efficient after the initial runs.

How do the base model and Raven variants differ in behavior?

The base model is described as not being instruction-tuned in a clear, reliable way; prompts can lead to long, repetitive question/answer-style continuations. A separate chatbot version is characterized as hit-and-miss. Raven, by contrast, is described as a 7B-class model fine-tuned on the Stanford Alpaca dataset, producing more coherent instruction-following outputs (e.g., explaining ravens, writing a song, and interpreting a metaphor).

What practical options does the transcript mention for trying these models?

The transcript recommends using author-provided GitHub notebook versions that run in free Google Colab. It specifically mentions a 1.5B model that can run in Colab and a 3B model that may require paid Colab or a better GPU. It also points to a Hugging Face space for Raven as an easy way to test without local setup.

Review Questions

What specific design change lets RWKV train with transformer-like parallelization while still using RNN-like generation at inference?
Why does the transcript suggest Raven’s outputs are more instruction-like than the base model’s outputs?
How does the transcript connect RWKV’s inference mechanism to longer context windows, and what context limits does it compare against?

Key Points

1
RWKV combines transformer-style training (parallelizable, linear-scaling attention) with RNN-like inference (stateful generation).
2
The architecture is positioned as a way to address RNNs’ historical training bottlenecks without giving up long-context behavior.
3
Training reportedly used 64 A100 GPUs for about three months, with later changes reducing cost substantially.
4
RWKV is released under Apache 2.0, enabling commercial use without licensing friction.
5
The base model can produce very long, sometimes repetitive completions, suggesting limited instruction tuning.
6
Raven is a 7B-class model fine-tuned on the Stanford Alpaca dataset and shows stronger instruction-following in demos.
7
Raven/RWKV are presented as worth experimenting with despite the broader industry’s transformer-optimized hardware ecosystem.

Highlights

RWKV is built to be parallel during training like a transformer, then stateful during inference like an RNN—aiming for long-context generation.

The attention/training setup is described as scaling linearly with token count, which is central to the efficiency claim.

Raven’s Alpaca fine-tuning is credited for more coherent instruction-style outputs compared with the base model’s drifting completions.

Topics

RWKV Architecture
RNN vs Transformers
Long Context Generation
Parallel Training
Raven Alpaca Fine-Tuning

Mentioned

RNN
LSTM
GPU
A100
GPT
LLM
NLP
T4
A100s
RWKV