LIMA: Can you Fine-Tune Large Language Models (LLMs) with Small Datasets? Less Is More for Alignment

TL;DR

LIMA targets alignment with only 1,000 supervised prompt–answer examples by fine-tuning LLaMA 65B, aiming to reduce the need for large RLHF datasets.

Briefing Cornell Notes

Briefing

Meta AI’s LIMA (“Less Is More for Alignment”) argues that strong alignment behavior in large language models can be achieved with surprisingly small amounts of high-quality instruction data—making fine-tuning far more practical than the usual “collect lots of examples” approach. Instead of relying on reinforcement learning from human feedback (RLHF) at scale, LIMA fine-tunes Meta’s LLaMA 65B model using supervised training on just 1,000 carefully selected prompt–answer pairs, and reports preference gains over well-known alternatives in human and GPT-4-based evaluations.

The training recipe starts from the standard two-stage path used for many chat models: pretraining via next-token prediction on large text corpora, followed by alignment. LIMA keeps the pretraining step, but swaps the alignment stage. Rather than RLHF—where human raters score model responses and the model is trained to improve based on those reward signals—LIMA performs supervised fine-tuning (SFT) with a small dataset. The core claim is an “alignment hypothesis” that most of the knowledge and general capabilities are learned during pretraining, while the final fine-tuning mainly teaches which response formats and behaviors to use when interacting with users. If that’s true, then a small, well-curated instruction set can be enough to steer the model.

The dataset is built around instruction-style question answering drawn largely from Stack Exchange (about 400 examples) plus WikiHow, writing prompts (from a Reddit subreddit), “natural instructions,” and additional author-provided examples. The paper’s development set is also provided by the authors, and evaluation uses an AskReddit test split assembled by a separate group to reduce—though not eliminate—bias risk from overlapping prompt sources. The training set is described as containing 1,000 examples total, including 200 training prompts written by the authors with high-quality answers, and an additional 13 prompts that include toxicity or violence to broaden robustness.

LIMA’s fine-tuning also includes a practical modeling detail: a special end-of-turn token inserted at each user/assistant boundary so the model learns when one speaker’s utterance ends and the other begins. Hyperparameters are tuned during training, but the transcript emphasizes that this tokenization boundary signal is a key implementation choice.

Evaluation relies on pairwise preference judgments. Annotators see a single prompt and two candidate responses from different models, then choose which response is better or mark a tie. Results are reported both from human judgments and from GPT-4 acting as an evaluator. Across comparisons, LIMA is preferred over Alpaca, ChatGPT, and GPT-3 in substantial fractions of cases (with the transcript citing 43% preference over ChatGPT, 58% over Bard, and 65% over GPT-3 in the abstract). Against GPT-4, LIMA performs worse—GPT-4 “owns” LIMA more often—yet LIMA still wins a non-trivial share of comparisons.

In short, LIMA’s contribution is a cost-and-data-efficiency message: a strong, pretrained model like LLaMA 65B can be aligned to user-preferred behavior using only 1,000 high-quality examples, potentially reducing the time and expense of fine-tuning while still delivering measurable improvements over several popular instruction-tuned baselines.

Cornell Notes

LIMA (“Less Is More for Alignment”) claims that chat alignment can be achieved with a small, high-quality supervised dataset rather than large-scale RLHF. Starting from LLaMA 65B pretrained on massive text, it performs supervised fine-tuning on only 1,000 prompt–answer examples, using careful prompt construction and boundary-aware formatting (including a special end-of-turn token). The paper’s central hypothesis is that pretraining captures most general capabilities, while the small fine-tuning set mainly teaches the response style and interaction format users expect. In pairwise evaluations, LIMA is frequently preferred to Alpaca, ChatGPT, Bard, and GPT-3 by both humans and GPT-4-as-judge, though it trails GPT-4. The practical takeaway is that “less data” can still produce meaningful alignment gains when the data is curated well.

What training shift does LIMA make compared with common RLHF-based alignment pipelines?

LIMA keeps the usual pretraining stage (next-token prediction on large text) but replaces the RLHF alignment stage. Instead of training with reinforcement learning from human feedback—where responses are scored and used as reward signals—LIMA uses supervised fine-tuning on labeled prompt–answer pairs. The transcript highlights that LIMA fine-tunes LLaMA 65B with only 1,000 examples, treating those examples as the primary way to teach the desired interaction behavior and formatting.

Why does LIMA believe a small dataset can still produce alignment improvements?

The transcript points to an “alignment hypothesis” that most knowledge and capabilities are learned during pretraining, while alignment mainly determines which response formats and sub-distributional behaviors the model should use during user interaction. Under that view, fine-tuning doesn’t need massive data; it mainly needs enough high-quality examples to steer how the model responds. That’s why LIMA focuses on a curated 1,000-example instruction set rather than scaling up training data.

How is the LIMA fine-tuning dataset constructed, and what sources are used?

The dataset is described as 1,000 total examples, heavily drawn from Stack Exchange (roughly 400 examples), plus WikiHow, writing prompts (a Reddit subreddit), and “natural instructions.” The transcript also notes author-provided examples (about 200) and a small set of prompts with toxicity or violence (13) to improve generality/robustness. Development and test splits are provided or assembled by different groups to mitigate overlap bias.

What evaluation method is used to compare LIMA against other models?

Evaluation uses pairwise preference comparisons. Annotators are shown one prompt and two responses generated by different models, then select which response is better or mark a tie. The transcript also describes a parallel evaluation where GPT-4 serves as an automated judge, producing preference percentages that can be compared with human judgments.

What implementation detail helps LIMA learn the chat turn structure?

The transcript emphasizes a special end-of-turn token inserted at each user/assistant boundary. Every time the conversation switches speakers, the token is injected so the model can learn when one utterance ends and the next begins. This boundary signaling is presented as a key training detail alongside hyperparameter choices.

Review Questions

How does LIMA’s supervised fine-tuning approach differ from RLHF in terms of training signals and data requirements?
What does the “alignment hypothesis” imply about where capabilities come from (pretraining vs. fine-tuning)?
Why might pairwise preference evaluation (human and GPT-4-as-judge) produce different outcomes when comparing LIMA to GPT-4?

Key Points

1
LIMA targets alignment with only 1,000 supervised prompt–answer examples by fine-tuning LLaMA 65B, aiming to reduce the need for large RLHF datasets.
2
The approach rests on the idea that pretraining captures most capabilities, while fine-tuning mainly teaches response format and interaction behavior.
3
LIMA’s dataset is curated from sources including Stack Exchange, WikiHow, writing prompts (Reddit), and natural instructions, with additional author-written examples and a small toxicity/violence subset.
4
A special end-of-turn token is inserted at each user/assistant boundary during fine-tuning to help the model learn turn-taking structure.
5
Pairwise preference evaluation is used: annotators (and separately GPT-4) choose which of two model responses is better or whether they tie.
6
LIMA shows strong preference wins over several instruction-tuned baselines (e.g., Alpaca, ChatGPT, Bard, GPT-3) but trails GPT-4 in head-to-head comparisons.

Highlights

LIMA replaces RLHF with supervised fine-tuning on just 1,000 examples, treating alignment as mostly a formatting/behavior steering problem rather than a capability-learning problem.

The special end-of-turn token is used to explicitly mark speaker boundaries, reinforcing the chat structure during training.

In pairwise tests, LIMA is frequently preferred to Alpaca, ChatGPT, Bard, and GPT-3 by both humans and GPT-4-as-judge, but it performs worst against GPT-4.

The evaluation design uses pairwise comparisons with both human annotators and GPT-4 to quantify preference rates across models.

Topics

LIMA Fine-Tuning
Supervised Alignment
Small Dataset
Preference Evaluation
Chat Turn Tokens

Mentioned

LLM
RLHF
SFT
GPT-4