Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Fine-tune TinyLlama with LoRA to update only adapter weights, making training practical on a single GPU like a T4.
Briefing
Fine-tuning a “tiny” LLM on a custom dataset can deliver strong sentiment and topic predictions using a single GPU—provided the training setup is tuned carefully. The core workflow pairs TinyLlama with LoRA adapters, formats each example so it fits within the model’s context window, trains only the adapter weights, then merges the adapter back into the base model for straightforward inference. The payoff is practical: faster training and inference than larger models, plus the ability to keep private company data inside the fine-tuning pipeline.
The case study targets two outputs from crypto news: (1) sentiment (positive/neutral/negative) and (2) subject (Bitcoin, altcoin, blockchain, Ethereum, NFT, DeFi). Instead of relying on prompt engineering alone, the approach assumes tiny models won’t match larger systems on benchmarks without adaptation. Fine-tuning is also framed as a way to reduce prompt length—using a compact template that elicits structured outputs—while avoiding data exposure by training on internal datasets.
Dataset preparation centers on quality and distribution. The transcript recommends more than a thousand high-quality examples and uses a stratified train/validation/test split to preserve label frequencies across subjects and sentiment classes. The crypto dataset (Crypton News) is structured with title, text, subject, and sentiment fields, while additional sentiment metadata (like subjectivity) is used only for analysis, not prediction. A key warning emerges from the real-world distribution: the data is biased toward Bitcoin and blockchain, and sentiment is skewed toward neutral/positive. The example keeps the original distributions rather than applying rebalancing techniques, but notes oversampling/undersampling as possible remedies.
On the modeling side, the setup loads TinyLlama (non-chat variant) and adds a padding token to the tokenizer, then resizes token embeddings so the model can pad correctly. Incorrect padding can cause repetition artifacts, so padding configuration is treated as a critical detail. The transcript also checks token counts after applying a custom template (title + text + a structured “prediction” section containing subject and sentiment) to ensure inputs stay well under the model’s 2048 context window.
LoRA is configured to make training feasible on a free-tier Google Colab T4 GPU. Rather than updating the full model, LoRA injects trainable low-rank matrices into targeted layers—specifically self-attention and MLP linear layers—while the base weights remain frozen. The example uses a LoRA rank (r) of 128 and LoRA alpha scaled accordingly, with a small dropout. Training uses fp16 and a standard AdamW optimizer (no quantized optimizer), runs for one epoch, and employs a completion-only loss strategy via a data collator that masks tokens outside the “prediction” template.
Evaluation shows the practical results: on a test set of about 1,242 examples, subject prediction accuracy lands around 78.6% using a rough metric, while sentiment accuracy reaches just over 90%. The transcript includes qualitative samples where label noise appears—sometimes the model’s sentiment or subject seems more reasonable than the provided labels—suggesting that real-world datasets can be imperfect even when they look structured. After training, the adapter is merged back into the base model, saved with the tokenizer, and used through a text-generation pipeline to produce structured outputs from new titles and article text.
Cornell Notes
TinyLlama can be fine-tuned on a custom dataset for structured predictions (crypto news subject + sentiment) using LoRA on a single GPU. The workflow prepares a stratified train/validation/test split, formats each example with a consistent template that fits within TinyLlama’s 2048 context window, and adds a padding token so training doesn’t produce repetition artifacts. LoRA trains only adapter parameters by injecting low-rank matrices into TinyLlama’s self-attention and MLP layers (rank r=128 in the example), keeping the base model frozen. After one epoch of fp16 training, merging the LoRA adapter back into the base model yields strong results: ~78.6% subject accuracy and just over 90% sentiment accuracy on a test set of ~1,242 examples. This matters because it enables private-data fine-tuning without the cost of larger models.
Why choose TinyLlama (a “tiny” model) instead of a larger LLM for this task?
What dataset preparation steps most affect fine-tuning quality in this workflow?
How does the setup ensure the model doesn’t exceed its context window?
Why add a padding token and resize embeddings when fine-tuning?
What exactly does LoRA change, and why does it make single-GPU fine-tuning feasible?
How is the loss computed so the model learns the “prediction” portion rather than the whole prompt?
Review Questions
- What are the consequences of incorrect padding configuration during fine-tuning, and how does the workflow prevent them?
- How does stratified splitting influence evaluation reliability for multi-class subject and sentiment labels?
- Why does masking labels to -100 outside the prediction template matter for training behavior and output structure?
Key Points
- 1
Fine-tune TinyLlama with LoRA to update only adapter weights, making training practical on a single GPU like a T4.
- 2
Use a stratified train/validation/test split to preserve label distributions across both subject classes and sentiment classes.
- 3
Format each example with a consistent template and verify token counts stay well within TinyLlama’s 2048 context window.
- 4
Add and configure a padding token (including padding side) and resize token embeddings to avoid repetition artifacts.
- 5
Target LoRA injection points in self-attention and MLP linear layers; choose LoRA rank (r=128 in the example) and LoRA alpha to scale learning rate appropriately.
- 6
Train with fp16 and a completion-only loss strategy by masking non-target tokens (labels set to -100) so the model learns the structured output fields.
- 7
Merge the trained LoRA adapter back into the base model for simple inference via a standard text-generation pipeline.