Build a Small Language Model (SLM) From Scratch | Make it Your Personal Assistant

TL;DR

SLMs are practical because they achieve useful text generation with fewer than 1 billion parameters, reducing compute and memory requirements.

Briefing Cornell Notes

Briefing

Small language models (SLMs)—typically defined as language models with fewer than 1 billion parameters—are gaining attention because they deliver useful text generation with far lower compute and memory demands than large systems like GPT-3 (about 175 billion parameters). That efficiency makes SLMs practical for research prototypes, edge devices, and narrow, specialized assistants. The core takeaway is that building an SLM end-to-end is less about magic and more about a disciplined pipeline: choose the right data, convert it into tokenized training signals, train a compact transformer, then decode outputs with controlled sampling.

The build starts with data selection. For training, the project uses Tiny Stories, a synthetic dataset of short stories generated by GPT-3, GPT-4, and GPT-3.5. The dataset is intentionally simple—written in language a typical 3–4-year-old can understand—so a small model can still learn coherent English. Tiny Stories is hosted on Hugging Face and provides 2 million training samples plus 20,000 validation samples.

Next comes data preparation, beginning with tokenization. Text is converted into numerical token IDs using the GPT-2 subword tokenizer, which splits words into smaller pieces and assigns each piece a unique ID. Each story becomes a sequence of token IDs, stored in a dictionary-like structure. Those IDs are then merged into a large corpus for efficient downstream processing.

Because tokenized corpora can exceed RAM limits, the pipeline stores the token IDs in binary files on disk and uses numpy’s memmap to map large files directly into memory. This enables training on datasets larger than available RAM while still supporting fast batching. The result is two prepared binary datasets—train.bin and validation.bin—containing millions of token IDs.

Training data is then reshaped into next-token prediction pairs. Token sequences are chunked so that each input chunk predicts the same sequence shifted by one token. Batching groups many input-output pairs together, improving throughput and stabilizing learning by averaging gradients across samples.

The model itself is a compact transformer. Tokens are mapped to 768-dimensional embeddings, and positional embeddings are added so the model can reason about order. Each transformer block combines multi-head self-attention (to capture relationships across the sequence) with feed-forward layers (to learn nonlinear transformations). Residual connections help gradients flow, while dropout reduces overfitting. Stacking multiple transformer blocks builds the full architecture, followed by layer normalization and a final linear projection that produces logits over the vocabulary.

Configuration focuses on practical capacity controls: vocab size, block size (context length), number of layers, number of attention heads, embedding dimension, and dropout. Training uses an optimizer (AdamW) with gradient accumulation, a learning-rate schedule with warm-up and decay, cross-entropy loss across all tokens, and gradient clipping to prevent exploding gradients. The best checkpoint is selected using validation loss.

After training, inference generates text by predicting the next token repeatedly. The model loads the best saved parameters and uses the generate function, with temperature and top-k sampling to tune creativity and diversity. The end result is a small, efficient SLM that can produce coherent English—built from scratch through data engineering, transformer training, and controlled decoding.

Cornell Notes

The pipeline builds a small language model (SLM) by training a compact transformer on next-token prediction. It starts with Tiny Stories (2M train, 20K validation), a synthetic dataset written in simple English, then tokenizes it using the GPT-2 subword tokenizer. Token IDs are stored efficiently on disk and accessed via numpy memmap so training can scale beyond RAM limits. Training uses input-output pairs created by shifting token sequences, batched for speed and gradient stability, with 768-dimensional token embeddings plus positional embeddings. A stack of transformer blocks—multi-head self-attention, feed-forward layers, residual connections, and dropout—learns language patterns, and inference generates text using temperature and top-k sampling.

Why choose Tiny Stories for training a small language model, and what does its structure enable?

Tiny Stories is synthetic short-story data generated by GPT-3, GPT-3.5, and GPT-4, but it’s intentionally written in language suited to a typical 3–4-year-old. That simplicity helps a model with fewer parameters still learn coherent English. The dataset is available on Hugging Face with 2 million training samples and 20,000 validation samples, giving enough variety for learning while keeping the task approachable for an SLM.

How does tokenization turn raw text into something a neural network can train on?

Tokenization converts text into numerical token IDs. Here, the GPT-2 subword tokenizer splits text into smaller pieces (subwords) and assigns each piece a unique ID. Each story becomes a sequence of token IDs, which are stored and merged into a corpus. This matters because transformer layers operate on numbers (embeddings derived from token IDs), not raw words.

What problem does numpy memmap solve during training, and how is it used?

Tokenized datasets can be too large to fit in RAM. Instead of keeping everything in memory, the pipeline writes token IDs to binary files on disk and uses numpy’s memmap to map those files into memory as needed. This allows processing datasets larger than available RAM and supports efficient batching. The prepared outputs are train.bin and validation.bin, each containing millions of token IDs.

What training objective is used, and how are input-output pairs constructed?

The model trains with next-token prediction. Token sequences are chunked so that each input chunk predicts the same sequence shifted by one token: the output is the input tokens shifted one step forward. This setup teaches the transformer to model context and learn patterns needed to generate the next word or subword.

What are the key components inside each transformer block, and what roles do they play?

Each transformer block combines multi-head self-attention and feed-forward neural networks. Self-attention lets the model focus on different parts of the sequence to capture dependencies regardless of position. The feed-forward network applies additional nonlinear transformations to refine representations. Residual (shortcut) connections help gradients flow, and dropout reduces overfitting. Multi-head attention splits attention into multiple heads so different heads can learn different relationships (e.g., syntax vs. semantics).

How does inference control the style and diversity of generated text?

During inference, the model repeatedly predicts the next token given the growing sequence using the generate function. Temperature adjusts randomness in the probability distribution, while top-k sampling restricts choices to the k most likely tokens. Together, they control how conservative or creative the output becomes while still maintaining coherence learned during training.

Review Questions

What changes in the model’s behavior when block size or number of transformer layers (N_layer) are increased?
Explain how next-token prediction training data is formed from a token sequence and why shifting by one token is essential.
Why are gradient clipping and a learning-rate schedule with warm-up and decay important for stable transformer training?

Key Points

1
SLMs are practical because they achieve useful text generation with fewer than 1 billion parameters, reducing compute and memory requirements.
2
Tiny Stories provides simple synthetic training text (2M training, 20K validation) that helps small models learn coherent English.
3
Tokenization with the GPT-2 subword tokenizer converts stories into token ID sequences suitable for transformer training.
4
numpy memmap enables training on tokenized corpora stored as binary files without requiring the entire dataset in RAM.
5
Next-token prediction is implemented by chunking token sequences and shifting outputs by one token.
6
A compact transformer stack—multi-head self-attention, feed-forward layers, residual connections, and dropout—learns language patterns from batched cross-entropy loss.
7
Inference uses temperature and top-k sampling to control randomness and diversity during text generation.

Highlights

Tiny Stories is designed for simplicity—language aimed at a 3–4-year-old—making it a good fit for training models with limited capacity.

Storing tokenized data as binary files and using numpy memmap lets training scale beyond RAM limits while still supporting batching.

Training is framed as next-token prediction: each input chunk predicts the same sequence shifted by one token.

Multi-head self-attention plus feed-forward layers inside stacked transformer blocks is the core mechanism for learning context and generating coherent text.

Temperature and top-k sampling directly shape how creative versus conservative the generated English becomes.

Topics

Small Language Models
Tiny Stories Dataset
Tokenization and Memmap
Transformer Architecture
Next-Token Training
Text Generation Sampling

Mentioned

SLM
GPT-3
GPT-4
GPT-2
RAM
AdamW

Build a Small Language Model (SLM) From Scratch | Make it Your Personal Assistant | Tech Edge AI