Parameter Efficient Fine Tuning

TL;DR

Parameter-efficient fine-tuning freezes the pretrained Transformer backbone and trains only small task-specific parameters, reducing storage and distribution costs across many tasks.

Briefing Cornell Notes

Briefing

Parameter-efficient fine-tuning is presented as a practical way to adapt large Transformer and language models to new tasks without retraining the full weight set. Instead of updating every parameter with gradient descent, methods insert small, task-specific components—adapters, low-rank updates, or learned “prefix” key/value states—while freezing the original model. The payoff is lower storage and faster deployment of task specializations, since each new task can be shipped as a small parameter bundle rather than a full model checkpoint.

The baseline concept of fine-tuning is laid out first: start from a pre-trained model, change the output head if the label space differs, and then train with gradient descent while freezing some layers (often the entire backbone at first, then gradually unfreezing). Full fine-tuning yields a new complete parameter set, but it is expensive in both compute and—crucially for many real-world deployments—model distribution and storage.

The talk then pivots to the 2019 adapter approach (Transformers adapters). Adapters are small bottleneck networks inserted into Transformer blocks (commonly around the feed-forward sublayer, sometimes also near attention). They include a skip connection so the adapter can behave like an identity mapping when needed, which helps preserve the original model’s behavior. During training, only the adapter parameters are updated; the rest of the Transformer remains frozen. Experiments on GLUE benchmark tasks are used to argue that adapter tuning can reach performance comparable to full fine-tuning while training far fewer parameters. The discussion emphasizes that the comparison is often framed in terms of parameter count (storage/transfer) more than raw compute, since the frozen backbone still runs during forward passes.

From there, the ecosystem expands. AdapterHub is described as a standardized way to publish and reuse these task-specific adapter modules, built by adding “hooks” into the Hugging Face Transformers library so adapters can be inserted into the residual stream. The talk also surveys several parameter-efficient alternatives:

LoRA (low-rank adaptation) decomposes weight updates into low-rank matrices, reducing the number of trainable parameters while still targeting key transformations (originally discussed for attention projections, but viewed as broadly applicable). Prefix tuning is framed as a different mechanism: rather than inserting a new module into the network, it learns extra “virtual” past key/value states that are prepended to the attention cache. These learned key/value vectors steer generation for a task while leaving the Transformer weights untouched. Prompt tuning and P-tuning are positioned as related variants that place learnable embeddings earlier in the pipeline (prompt/prefix learning), with P-tuning using an LSTM-based prompt encoder to generate task-specific virtual tokens.

The session closes with additional lightweight strategies such as IA3-style modulation, which adjusts intermediate activations using small learned scaling parameters instead of adding new sub-networks. Throughout, the central theme is modular specialization: large models can be adapted to many tasks by shipping small, composable parameter sets—improving practicality for multi-task systems where loading a full fine-tuned model per task would be too costly.

Cornell Notes

Parameter-efficient fine-tuning adapts large Transformer models to new tasks by training only a small set of task-specific parameters while freezing the original backbone. The classic baseline is full fine-tuning, where layers are unfrozen and updated with gradient descent, but that approach is costly to store and distribute for many tasks. Adapters (2019) insert small bottleneck networks with skip connections into Transformer blocks; training updates only those adapter weights and can match full fine-tuning performance on benchmarks like GLUE with far fewer trainable parameters. Prefix tuning and related prompt/p-tuning methods avoid inserting new modules by learning task-specific “virtual” past key/value states that steer attention during generation. The practical motivation is modularity: each new task can be packaged as a small parameter update (e.g., via AdapterHub) rather than a full model checkpoint.

How does full fine-tuning differ from parameter-efficient fine-tuning in what gets updated?

Full fine-tuning typically updates a large fraction—or all—of the model weights using gradient descent after adapting the output head (e.g., changing the classification layer when the label space changes). Parameter-efficient approaches freeze the pretrained Transformer backbone and train only a small subset of parameters: adapters inserted into residual streams, low-rank weight updates (LoRA), or learned prefix/prompt key/value states. The frozen backbone still runs during inference, but the task-specific “delta” is much smaller to store and distribute.

What is the core mechanism of the adapter approach (including why a skip connection matters)?

Adapters insert a small bottleneck network into Transformer blocks (often around the feed-forward sublayer). The adapter includes a skip connection so it can act like an identity mapping—making it easier to preserve the pretrained behavior when the task-specific signal is weak. Training updates only the adapter parameters while the rest of the Transformer stays frozen, and the resulting task specialization is saved as a small module.

Why is AdapterHub important in practice?

AdapterHub is presented as a standardized way to publish and reuse adapters. It works by integrating with the Hugging Face Transformers library and using hooks to insert small sub-networks into the residual stream. That means a system can download a pretrained backbone once, then swap in small task adapters (rather than loading a full fine-tuned model per task).

How does prefix tuning steer a Transformer without changing its weights?

Prefix tuning learns task-specific past key/value states (often described as “virtual tokens”) that are prepended to the attention cache. During generation, the model attends to these learned key/value entries along with the real tokens, and gradients update only the prefix parameters. The Transformer weights remain frozen; the learned key/value vectors condition the output by altering what attention can focus on.

What distinguishes prompt tuning, prefix tuning, and P-tuning?

Prompt tuning learns learnable embeddings at the input/embedding side (effectively a task-specific prefix at the earliest layer), rather than inserting virtual key/value states throughout attention. Prefix tuning learns virtual past key/value states that are used by attention across layers. P-tuning is described as a variant where an LSTM-based prompt encoder generates the task-specific virtual token embeddings, with the task name feeding into the prompt encoder and the outputs being used to condition the frozen model.

What is IA3-style modulation in this landscape?

IA3-style methods avoid adding new sub-networks. Instead, they learn small modulation parameters that scale intermediate activations (including in attention/value and feed-forward middle layers). The talk frames this as lightweight enough to be effective, with reported performance comparisons suggesting it can be competitive with other parameter-efficient strategies.

Review Questions

When adapting a pretrained model to a new label space, which parts of the architecture typically change under full fine-tuning, and which parts stay frozen under adapter-based tuning?
Explain how prefix tuning uses learned past key/value states to condition generation while keeping Transformer weights unchanged.
Compare adapters and LoRA: where do the trainable parameters live, and what kind of transformation do they approximate or replace?

Key Points

1
Parameter-efficient fine-tuning freezes the pretrained Transformer backbone and trains only small task-specific parameters, reducing storage and distribution costs across many tasks.
2
Adapters insert bottleneck networks with skip connections into Transformer blocks; only adapter weights are updated, and performance can approach full fine-tuning on benchmarks like GLUE.
3
AdapterHub standardizes how adapters are published and reused by integrating with Hugging Face Transformers via hooks that insert adapter modules into the residual stream.
4
LoRA reduces trainable parameter count by expressing weight updates as low-rank matrix decompositions, targeting key linear transformations (often attention projections).
5
Prefix tuning conditions a frozen model by learning task-specific “virtual” past key/value states that are prepended to the attention cache.
6
Prompt tuning, P-tuning, and related prompt-learning variants differ mainly in where the learnable task parameters are injected (embedding side vs attention key/value side) and how they are generated (e.g., LSTM prompt encoder).
7
Lightweight modulation methods like IA3 adjust intermediate activations with small learned scaling parameters rather than adding new networks.

Highlights

Adapters train only small bottleneck modules inserted into Transformer blocks, using skip connections so the adapter can behave close to identity when needed.

Prefix tuning learns task-specific past key/value states that steer attention without updating any Transformer weights.

AdapterHub’s practical value comes from packaging task specializations as small downloadable modules rather than full fine-tuned checkpoints.

Prompt/p-tuning variants treat task conditioning as learnable “virtual prompts,” reducing reliance on full backprop through the entire model.

Topics

Fine-Tuning
Adapters
Prefix Tuning
LoRA
Prompt Learning

Mentioned

GLUE
LoRA
IA3
RLHF
PPO
MLP
LSTM
QKV