Fine-Tuning LLM on Your Data using Single GPU | Sentiment Analysis for Cryptocurrency Tweets
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Fine-tuning Quentry 3 on ~2,000 labeled crypto tweets using a single 16GB T4 can raise sentiment-and-ticker extraction accuracy from about 46% to 66.5% on a 200-example test set.
Briefing
Fine-tuning Quentry 3 on a small, sentiment-labeled cryptocurrency tweet dataset can deliver a sizable accuracy jump—even when training runs on a single, modest 16GB GPU (a T4). The core result: after supervised fine-tuning with LoRA-style adapters on roughly 2,000 training examples, sentiment-and-ticker extraction accuracy rises from about 46% (baseline, untrained) to about 66.5% on the full 200-example test set—an improvement of roughly 20 percentage points, and described as ~20–25% overall.
The workflow starts by turning a tabular dataset into a Hugging Face training set built for chat-style instruction tuning. The source data comes from a financial tweets crypto dataset by Steven Aurman, hosted on Hugging Face. The dataset includes tweet text plus ticker labels and sentiment, but it also contains URLs and sometimes embedded images. To keep the task focused on text, tweets containing images are removed, tweets shorter than 100 characters are filtered out, and URLs are stripped using a regular expression. The remaining examples are split into parquet training and test files (about 2,000 train rows and ~200 test rows), with deduplication efforts to reduce leakage.
A key twist is how the labels are generated and used. Tweet text is run through Gemini 2.5 Flash to extract (1) the sentiment class (bullish/neutral/bearish), (2) the relevant cryptocurrency tickers, and (3) a short “sentiment reasoning” sentence explaining why those labels were chosen. That reasoning sentence becomes part of the training target, not just the final classification. The dataset therefore includes both the structured JSON output (sentiment + tickers) and a natural-language justification.
Before training, the notebook checks label balance and token lengths. The sentiment distribution is described as roughly neutral/bullish balanced, while bearish is underrepresented—a potential red flag, though no rebalancing is performed in this run. Tokenization is done using the model’s chat template, and sequence lengths are measured; most examples fall well under 512 tokens, keeping training fast.
The baseline evaluation highlights why fine-tuning is needed. On a small sample of 50 test prompts, the untrained model parses correctly 43 times, but only achieves about 46.5% accuracy. Failures include outputs that get stuck in long “thinking” loops and cases where sentiment and tickers are both wrong.
Training uses 4-bit quantization (bitsandbytes) to fit the model on the T4, then applies supervised fine-tuning (SFT) with LoRA adapters. The adapter training targets only a small fraction of parameters—about 20 million trainable parameters, roughly 3.3% of the 600M model—so degradation is described as negligible. Two epochs are used; a second epoch improves test accuracy by about 6–7%. Training runs about 31 minutes, with warmup and a linear learning-rate schedule.
After merging the adapter back into the base model, inference becomes faster because the fine-tuned model’s “thinking” output is much shorter. On the full 200-example test set, accuracy reaches 66.5%, with the notebook also noting that JSON parsing succeeds more reliably than in the baseline evaluation. Experiments around “completion-only” training are discussed: excluding the reasoning tokens can preserve sentiment/ticker quality but hurts the model’s ability to generate the reasoning that the evaluation expects. Finally, a practical mitigation for rambling is introduced via a “max thinking tokens” processor inspired by Zach Mueller’s approach—forcing an end-of-thinking token once a budget is exceeded.
Overall, the run demonstrates that careful prompt formatting, label preprocessing, and adapter-based fine-tuning can materially improve structured sentiment extraction on crypto tweets, even under tight hardware constraints.
Cornell Notes
The notebook fine-tunes Quentry 3 (a ~600M instruction model) for cryptocurrency tweet sentiment and ticker extraction using only a single 16GB T4 GPU. Labels come from Gemini 2.5 Flash, including bullish/neutral/bearish sentiment, a list of tickers, and a short “sentiment reasoning” sentence that becomes part of the training target. After supervised fine-tuning with 4-bit quantization and LoRA-style adapters (about 20M trainable parameters), test accuracy rises from roughly 46% (untrained baseline on 50 examples) to 66.5% on 200 test examples. The improvement is attributed to structured chat prompting plus training on the reasoning tokens, which also reduces long “thinking” loops during inference.
How were the tweet dataset and labels prepared so the model could learn sentiment + tickers reliably?
Why does the notebook include “sentiment reasoning” in training instead of only training on the final sentiment/ticker labels?
What baseline behavior motivated fine-tuning, and how was baseline performance measured?
How does the training fit within a single T4 GPU, and what parts of the model are actually updated?
What evaluation result shows the fine-tuning worked, and what changed in inference behavior?
What practical technique is suggested to prevent excessive “thinking” during generation?
Review Questions
- What preprocessing steps were applied to the raw tweets (images, length, URLs), and how might each step affect sentiment/ticker learning?
- Why might training with reasoning tokens (instead of completion-only) improve structured sentiment extraction and output formatting?
- How do 4-bit quantization and LoRA adapters together make fine-tuning feasible on a 16GB T4, and what fraction of parameters are updated?
Key Points
- 1
Fine-tuning Quentry 3 on ~2,000 labeled crypto tweets using a single 16GB T4 can raise sentiment-and-ticker extraction accuracy from about 46% to 66.5% on a 200-example test set.
- 2
Gemini 2.5 Flash labels include not only sentiment and tickers but also a short “sentiment reasoning” sentence, and that reasoning is included in the training target.
- 3
Dataset hygiene matters: tweets with images are removed, URLs are stripped, short tweets are filtered, and deduplication plus a fixed train/test split helps reduce leakage.
- 4
Baseline inference can fail in two ways—low accuracy and long “thinking” loops—so evaluation must consider both correctness and output parsing reliability.
- 5
The training strategy uses 4-bit quantization (bitsandbytes) plus LoRA-style adapters, updating ~20M parameters (~3.3% of the 600M model) rather than full fine-tuning.
- 6
Two epochs of SFT improved test accuracy by roughly 6–7%, while sequence lengths were kept well under 512 tokens to maintain fast training.
- 7
A max-thinking-token cutoff processor can force an end-of-thinking token to prevent rambling during generation, improving practical usability.