LIMA: Can you Fine-Tune Large Language Models (LLMs) with Small Datasets? Less Is More for Alignment
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LIMA targets alignment with only 1,000 supervised prompt–answer examples by fine-tuning LLaMA 65B, aiming to reduce the need for large RLHF datasets.
Briefing
Meta AI’s LIMA (“Less Is More for Alignment”) argues that strong alignment behavior in large language models can be achieved with surprisingly small amounts of high-quality instruction data—making fine-tuning far more practical than the usual “collect lots of examples” approach. Instead of relying on reinforcement learning from human feedback (RLHF) at scale, LIMA fine-tunes Meta’s LLaMA 65B model using supervised training on just 1,000 carefully selected prompt–answer pairs, and reports preference gains over well-known alternatives in human and GPT-4-based evaluations.
The training recipe starts from the standard two-stage path used for many chat models: pretraining via next-token prediction on large text corpora, followed by alignment. LIMA keeps the pretraining step, but swaps the alignment stage. Rather than RLHF—where human raters score model responses and the model is trained to improve based on those reward signals—LIMA performs supervised fine-tuning (SFT) with a small dataset. The core claim is an “alignment hypothesis” that most of the knowledge and general capabilities are learned during pretraining, while the final fine-tuning mainly teaches which response formats and behaviors to use when interacting with users. If that’s true, then a small, well-curated instruction set can be enough to steer the model.
The dataset is built around instruction-style question answering drawn largely from Stack Exchange (about 400 examples) plus WikiHow, writing prompts (from a Reddit subreddit), “natural instructions,” and additional author-provided examples. The paper’s development set is also provided by the authors, and evaluation uses an AskReddit test split assembled by a separate group to reduce—though not eliminate—bias risk from overlapping prompt sources. The training set is described as containing 1,000 examples total, including 200 training prompts written by the authors with high-quality answers, and an additional 13 prompts that include toxicity or violence to broaden robustness.
LIMA’s fine-tuning also includes a practical modeling detail: a special end-of-turn token inserted at each user/assistant boundary so the model learns when one speaker’s utterance ends and the other begins. Hyperparameters are tuned during training, but the transcript emphasizes that this tokenization boundary signal is a key implementation choice.
Evaluation relies on pairwise preference judgments. Annotators see a single prompt and two candidate responses from different models, then choose which response is better or mark a tie. Results are reported both from human judgments and from GPT-4 acting as an evaluator. Across comparisons, LIMA is preferred over Alpaca, ChatGPT, and GPT-3 in substantial fractions of cases (with the transcript citing 43% preference over ChatGPT, 58% over Bard, and 65% over GPT-3 in the abstract). Against GPT-4, LIMA performs worse—GPT-4 “owns” LIMA more often—yet LIMA still wins a non-trivial share of comparisons.
In short, LIMA’s contribution is a cost-and-data-efficiency message: a strong, pretrained model like LLaMA 65B can be aligned to user-preferred behavior using only 1,000 high-quality examples, potentially reducing the time and expense of fine-tuning while still delivering measurable improvements over several popular instruction-tuned baselines.
Cornell Notes
LIMA (“Less Is More for Alignment”) claims that chat alignment can be achieved with a small, high-quality supervised dataset rather than large-scale RLHF. Starting from LLaMA 65B pretrained on massive text, it performs supervised fine-tuning on only 1,000 prompt–answer examples, using careful prompt construction and boundary-aware formatting (including a special end-of-turn token). The paper’s central hypothesis is that pretraining captures most general capabilities, while the small fine-tuning set mainly teaches the response style and interaction format users expect. In pairwise evaluations, LIMA is frequently preferred to Alpaca, ChatGPT, Bard, and GPT-3 by both humans and GPT-4-as-judge, though it trails GPT-4. The practical takeaway is that “less data” can still produce meaningful alignment gains when the data is curated well.
What training shift does LIMA make compared with common RLHF-based alignment pipelines?
Why does LIMA believe a small dataset can still produce alignment improvements?
How is the LIMA fine-tuning dataset constructed, and what sources are used?
What evaluation method is used to compare LIMA against other models?
What implementation detail helps LIMA learn the chat turn structure?
Review Questions
- How does LIMA’s supervised fine-tuning approach differ from RLHF in terms of training signals and data requirements?
- What does the “alignment hypothesis” imply about where capabilities come from (pretraining vs. fine-tuning)?
- Why might pairwise preference evaluation (human and GPT-4-as-judge) produce different outcomes when comparing LIMA to GPT-4?
Key Points
- 1
LIMA targets alignment with only 1,000 supervised prompt–answer examples by fine-tuning LLaMA 65B, aiming to reduce the need for large RLHF datasets.
- 2
The approach rests on the idea that pretraining captures most capabilities, while fine-tuning mainly teaches response format and interaction behavior.
- 3
LIMA’s dataset is curated from sources including Stack Exchange, WikiHow, writing prompts (Reddit), and natural instructions, with additional author-written examples and a small toxicity/violence subset.
- 4
A special end-of-turn token is inserted at each user/assistant boundary during fine-tuning to help the model learn turn-taking structure.
- 5
Pairwise preference evaluation is used: annotators (and separately GPT-4) choose which of two model responses is better or whether they tie.
- 6
LIMA shows strong preference wins over several instruction-tuned baselines (e.g., Alpaca, ChatGPT, Bard, GPT-3) but trails GPT-4 in head-to-head comparisons.