Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU

TL;DR

Fine-tune TinyLlama with LoRA to update only adapter weights, making training practical on a single GPU like a T4.

Briefing Cornell Notes

Briefing

Fine-tuning a “tiny” LLM on a custom dataset can deliver strong sentiment and topic predictions using a single GPU—provided the training setup is tuned carefully. The core workflow pairs TinyLlama with LoRA adapters, formats each example so it fits within the model’s context window, trains only the adapter weights, then merges the adapter back into the base model for straightforward inference. The payoff is practical: faster training and inference than larger models, plus the ability to keep private company data inside the fine-tuning pipeline.

The case study targets two outputs from crypto news: (1) sentiment (positive/neutral/negative) and (2) subject (Bitcoin, altcoin, blockchain, Ethereum, NFT, DeFi). Instead of relying on prompt engineering alone, the approach assumes tiny models won’t match larger systems on benchmarks without adaptation. Fine-tuning is also framed as a way to reduce prompt length—using a compact template that elicits structured outputs—while avoiding data exposure by training on internal datasets.

Dataset preparation centers on quality and distribution. The transcript recommends more than a thousand high-quality examples and uses a stratified train/validation/test split to preserve label frequencies across subjects and sentiment classes. The crypto dataset (Crypton News) is structured with title, text, subject, and sentiment fields, while additional sentiment metadata (like subjectivity) is used only for analysis, not prediction. A key warning emerges from the real-world distribution: the data is biased toward Bitcoin and blockchain, and sentiment is skewed toward neutral/positive. The example keeps the original distributions rather than applying rebalancing techniques, but notes oversampling/undersampling as possible remedies.

On the modeling side, the setup loads TinyLlama (non-chat variant) and adds a padding token to the tokenizer, then resizes token embeddings so the model can pad correctly. Incorrect padding can cause repetition artifacts, so padding configuration is treated as a critical detail. The transcript also checks token counts after applying a custom template (title + text + a structured “prediction” section containing subject and sentiment) to ensure inputs stay well under the model’s 2048 context window.

LoRA is configured to make training feasible on a free-tier Google Colab T4 GPU. Rather than updating the full model, LoRA injects trainable low-rank matrices into targeted layers—specifically self-attention and MLP linear layers—while the base weights remain frozen. The example uses a LoRA rank (r) of 128 and LoRA alpha scaled accordingly, with a small dropout. Training uses fp16 and a standard AdamW optimizer (no quantized optimizer), runs for one epoch, and employs a completion-only loss strategy via a data collator that masks tokens outside the “prediction” template.

Evaluation shows the practical results: on a test set of about 1,242 examples, subject prediction accuracy lands around 78.6% using a rough metric, while sentiment accuracy reaches just over 90%. The transcript includes qualitative samples where label noise appears—sometimes the model’s sentiment or subject seems more reasonable than the provided labels—suggesting that real-world datasets can be imperfect even when they look structured. After training, the adapter is merged back into the base model, saved with the tokenizer, and used through a text-generation pipeline to produce structured outputs from new titles and article text.

Cornell Notes

TinyLlama can be fine-tuned on a custom dataset for structured predictions (crypto news subject + sentiment) using LoRA on a single GPU. The workflow prepares a stratified train/validation/test split, formats each example with a consistent template that fits within TinyLlama’s 2048 context window, and adds a padding token so training doesn’t produce repetition artifacts. LoRA trains only adapter parameters by injecting low-rank matrices into TinyLlama’s self-attention and MLP layers (rank r=128 in the example), keeping the base model frozen. After one epoch of fp16 training, merging the LoRA adapter back into the base model yields strong results: ~78.6% subject accuracy and just over 90% sentiment accuracy on a test set of ~1,242 examples. This matters because it enables private-data fine-tuning without the cost of larger models.

Why choose TinyLlama (a “tiny” model) instead of a larger LLM for this task?

The transcript highlights three practical reasons: (1) smaller parameter counts make training and inference faster, (2) tiny models can be fine-tuned on older or less powerful GPUs (the example uses a T4), and (3) despite being small, TinyLlama variants are trained on very large token corpora (the example cites TinyLlama trained on more than 3 trillion tokens). The tradeoff is that tiny models often need fine-tuning to reach strong benchmark performance on specific tasks.

What dataset preparation steps most affect fine-tuning quality in this workflow?

Two steps stand out: data quality and label distribution. The transcript recommends more than a thousand high-quality examples and suggests human review to catch “shady” points. It then uses a stratified split so each subject (Bitcoin, altcoin, blockchain, Ethereum, NFT, DeFi) and each sentiment class (positive/neutral/negative) appears proportionally in train, validation, and test sets. It also notes that real-world distributions can be biased (e.g., heavy Bitcoin/blockchain bias), which can influence accuracy.

How does the setup ensure the model doesn’t exceed its context window?

After building a custom template (title + text + a structured prediction block containing subject and sentiment), the transcript counts tokens per formatted example and verifies they fit within TinyLlama’s 2048 context window. In the example, inputs end up needing at most ~200 tokens, so context truncation isn’t a concern.

Why add a padding token and resize embeddings when fine-tuning?

The transcript treats padding as a correctness issue. It adds a padding token to the tokenizer, sets padding side to right, and resizes the model’s token embeddings to match the expanded vocabulary. Without correct padding, the model can start repeating the last tokens during generation/training, producing degraded outputs.

What exactly does LoRA change, and why does it make single-GPU fine-tuning feasible?

LoRA freezes the base model and trains only small adapter parameters inserted into specific layers. The example targets TinyLlama’s self-attention linear layers and MLP linear layers. With LoRA rank r=128 (and LoRA alpha scaled), the number of trainable parameters is kept manageable (the transcript estimates ~101M trainable parameters, about 8.4% of the original model). This reduces memory and compute enough to train on a T4.

How is the loss computed so the model learns the “prediction” portion rather than the whole prompt?

A completion-only training strategy masks tokens outside the prediction template. The data collator sets labels to -100 for everything before the template, so loss is computed only on the tokens corresponding to the subject and sentiment fields. The transcript notes that this can reduce unnecessary learning on the input text and speed up/clean up training.

Review Questions

What are the consequences of incorrect padding configuration during fine-tuning, and how does the workflow prevent them?
How does stratified splitting influence evaluation reliability for multi-class subject and sentiment labels?
Why does masking labels to -100 outside the prediction template matter for training behavior and output structure?

Key Points

1
Fine-tune TinyLlama with LoRA to update only adapter weights, making training practical on a single GPU like a T4.
2
Use a stratified train/validation/test split to preserve label distributions across both subject classes and sentiment classes.
3
Format each example with a consistent template and verify token counts stay well within TinyLlama’s 2048 context window.
4
Add and configure a padding token (including padding side) and resize token embeddings to avoid repetition artifacts.
5
Target LoRA injection points in self-attention and MLP linear layers; choose LoRA rank (r=128 in the example) and LoRA alpha to scale learning rate appropriately.
6
Train with fp16 and a completion-only loss strategy by masking non-target tokens (labels set to -100) so the model learns the structured output fields.
7
Merge the trained LoRA adapter back into the base model for simple inference via a standard text-generation pipeline.

Highlights

LoRA makes single-GPU fine-tuning feasible by training only low-rank adapters inserted into TinyLlama’s self-attention and MLP layers.

Correct padding token handling (plus embedding resizing) is treated as essential to prevent the model from repeating recent tokens.

A completion-only loss mask (labels = -100 outside the prediction block) focuses learning on the subject/sentiment fields.

On ~1,242 test examples, the example run reports ~78.6% subject accuracy and just over 90% sentiment accuracy, despite label noise in the dataset.

Topics

LoRA Fine-Tuning
TinyLlama
Sentiment Analysis
Crypto News Dataset
Single GPU Training

Mentioned

Venelin Valkov
LLM
LoRA
GPU
fp16
T4
ML
CSV
ML expert
KGO
AdamW
MLP