Get AI summaries of any video or article — Sign up free
Fine-tuning LLMs with PEFT and LoRA thumbnail

Fine-tuning LLMs with PEFT and LoRA

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Full fine-tuning scales poorly because it increases both compute requirements and checkpoint sizes as model weights grow.

Briefing

Fine-tuning large language models is expensive because it requires updating massive weight tensors, which drives up both compute needs and checkpoint sizes. Parameter-efficient fine-tuning (PEFT) tackles both issues by freezing most of a pre-trained model’s parameters and training only a small set of added weights—most notably through LoRA (low-rank adaptation). The practical payoff is that training becomes feasible on more modest hardware, checkpoints shrink from tens of gigabytes to tiny adapter files, and the model is less prone to catastrophic forgetting because the original weights remain intact.

LoRA works by inserting trainable adapter weights at selected points in the network while leaving the original model weights fixed. That design reduces the amount of data and compute required for adaptation, and it can preserve the model’s baseline capabilities even when fine-tuning runs for longer or on small datasets. The transcript also links PEFT’s benefits to better generalization in new scenarios and notes that this approach is increasingly used beyond language models, including image-generation systems like stable diffusion.

The walkthrough then shifts from concepts to implementation using Hugging Face tooling. Hugging Face released a PEFT library that consolidates multiple research-backed techniques and integrates with the transformers and accelerate ecosystems, enabling users to take off-the-shelf models from organizations such as Google and Meta and fine-tune them efficiently. The example focuses on LoRA fine-tuning a Bloom model—specifically “bloom 7 billion”—while using bitsandbytes to load the model in 8-bit. That quantization reduces GPU RAM usage and speeds up training and storage, making the workflow more accessible on hardware such as a T4 (with smaller model variants like Bloom 760M or 1.3B mentioned as alternatives).

Before training, the code freezes the original weights, with a few exceptions: layer norm layers are kept trainable, and certain tensors are maintained in float32 for stability. The LoRA configuration becomes the key control panel. It includes parameters such as the number of attention heads, the alpha scaling factor, LoRA dropout, and—critically—the task type (causal language modeling for decoder-only, GPT-style models versus encoder-decoder setups like T5/FLAN). Those settings determine how many parameters remain trainable; the transcript emphasizes that for a 7B model, the trainable portion is “tiny” compared with the full parameter count.

For data, the example uses a small custom dataset derived from English quotes. Instead of training the model to complete quotes, it constructs prompts that pair a quote with tag labels (e.g., “be yourself,” “honesty,” “inspirational”), using a special delimiter sequence chosen to rarely appear in pretraining. Training then runs with standard Hugging Face training arguments, including gradient accumulation to simulate larger effective batch sizes on limited GPUs, plus a warmup schedule to ramp the learning rate gradually.

After training, only the LoRA adapter weights are uploaded to the Hugging Face Hub, producing a checkpoint measured in megabytes rather than gigabytes. Inference loads the base Bloom model plus the trained adapters, then generates tags conditioned on the input quote. Early results show keyword capture but also repetition and occasional failure to learn the intended mapping fully—an outcome attributed to the short toy training run—while still demonstrating that LoRA-based PEFT can turn a large model into a task-specific generator with minimal storage overhead.

Cornell Notes

PEFT makes fine-tuning large language models practical by freezing most original weights and training only a small set of added parameters. LoRA (low-rank adaptation) is the main technique used here: adapters are inserted into the model, while the base model stays fixed, reducing compute and checkpoint size and helping avoid catastrophic forgetting. The example fine-tunes “bloom 7 billion” for a causal language modeling task using Hugging Face’s PEFT library plus bitsandbytes 8-bit loading. The LoRA configuration (attention heads, alpha scaling, dropout, and task type) controls how many parameters become trainable. The resulting adapter checkpoint uploads to the Hugging Face Hub as a tiny file (tens of megabytes), and inference combines the base model with the adapters to generate tags from input quotes.

Why does conventional fine-tuning become costly as models grow?

The transcript points to two linked problems: compute and storage. Updating full model weights requires much more GPU compute as parameter counts rise, often pushing users toward multi-GPU setups. It also inflates checkpoint sizes—e.g., a T5.XL checkpoint is described as ~40GB—so saving and distributing fine-tuned models becomes burdensome, especially as newer models reach 20B+ parameters and continue scaling.

How does LoRA reduce training cost without retraining the whole model?

LoRA adds trainable adapter weights while freezing the original pre-trained weights. Instead of backpropagating through and updating every parameter tensor, training focuses on a small number of extra weights inserted at selected locations (configured via LoRA settings). Because the base weights remain unchanged, the approach can preserve original capabilities and reduce catastrophic forgetting compared with full fine-tuning.

What role do Hugging Face’s PEFT and related libraries play in the workflow?

Hugging Face provides a PEFT library that implements multiple parameter-efficient methods and integrates with transformers and accelerate. That integration lets users load off-the-shelf models (e.g., Bloom) and apply LoRA adapters using standard training pipelines. The transcript also uses bitsandbytes to quantize the model to 8-bit during loading, reducing GPU RAM usage and making the example runnable on smaller hardware.

Which LoRA configuration choices most affect how much gets trained?

The transcript highlights several LoRA config parameters: number of attention heads, alpha scaling, LoRA dropout, and task type. Task type matters because it determines whether the setup targets decoder-only causal language modeling (GPT-style) or encoder-decoder tasks like T5/FLAN. Adjusting these settings changes the size of the trainable parameter set; the example notes that for a 7B model, the trainable parameters are “tiny” relative to the full model.

How does the example handle limited GPU memory during training?

It uses gradient accumulation to increase the effective batch size. Instead of a large batch size that would require more memory, the code performs multiple forward passes (e.g., four examples) and accumulates gradients before applying an update, making the effective batch size equivalent to a larger batch (the transcript gives an example equivalence to batch size 16). It also uses a warmup schedule so the learning rate ramps up gradually rather than starting at the full value.

What does the adapter checkpoint upload to the Hugging Face Hub include, and why is it small?

Only the LoRA adapter weights are uploaded, not the full base model. Because the base model weights stay frozen and are not part of the trained artifact, the checkpoint is tiny—on the order of tens of megabytes (the transcript mentions ~31MB). During inference, the system loads the base model and tokenizer and then downloads the adapter weights to assemble the fine-tuned behavior.

Review Questions

  1. What two main bottlenecks make full fine-tuning of large language models difficult, and how does PEFT address each?
  2. In LoRA, what is frozen and what is trained, and how does that design relate to catastrophic forgetting?
  3. Which LoRA configuration fields in the example most directly change the number of trainable parameters, and why does task type matter?

Key Points

  1. 1

    Full fine-tuning scales poorly because it increases both compute requirements and checkpoint sizes as model weights grow.

  2. 2

    PEFT reduces cost by freezing most pre-trained parameters and training only a small set of added weights.

  3. 3

    LoRA implements PEFT by training low-rank adapter weights while keeping the original model weights fixed, which can lessen catastrophic forgetting.

  4. 4

    Hugging Face’s PEFT library integrates with transformers/accelerate, enabling LoRA fine-tuning of off-the-shelf models like Bloom.

  5. 5

    bitsandbytes 8-bit loading lowers GPU RAM usage, making LoRA fine-tuning more feasible on limited hardware.

  6. 6

    LoRA configuration choices (attention heads, alpha scaling, dropout, task type) determine how many parameters remain trainable.

  7. 7

    Uploading only adapter weights to the Hugging Face Hub yields checkpoints measured in megabytes rather than gigabytes.

Highlights

LoRA turns fine-tuning into “train the add-ons, freeze the base,” shrinking what must be saved and updated.
Checkpoint sizes drop dramatically because only adapter weights are uploaded—tens of megabytes instead of tens of gigabytes.
Task type (causal decoder-only vs encoder-decoder) is a decisive LoRA setting that changes how the adapters are applied.
Gradient accumulation lets users simulate larger effective batch sizes when GPU memory is limited.
Short toy training can still demonstrate tag generation, but repetition and incomplete learning can appear without longer training.

Topics

Mentioned

  • PEFT
  • LoRA
  • GPU
  • T5
  • GPT
  • FLAN
  • T4
  • 8-bit
  • GPU Ram