Fine-Tuning LLM on Your Data using Single GPU | Sentiment Analysis for Cryptocurrency Tweets

TL;DR

Fine-tuning Quentry 3 on ~2,000 labeled crypto tweets using a single 16GB T4 can raise sentiment-and-ticker extraction accuracy from about 46% to 66.5% on a 200-example test set.

Briefing Cornell Notes

Briefing

Fine-tuning Quentry 3 on a small, sentiment-labeled cryptocurrency tweet dataset can deliver a sizable accuracy jump—even when training runs on a single, modest 16GB GPU (a T4). The core result: after supervised fine-tuning with LoRA-style adapters on roughly 2,000 training examples, sentiment-and-ticker extraction accuracy rises from about 46% (baseline, untrained) to about 66.5% on the full 200-example test set—an improvement of roughly 20 percentage points, and described as ~20–25% overall.

The workflow starts by turning a tabular dataset into a Hugging Face training set built for chat-style instruction tuning. The source data comes from a financial tweets crypto dataset by Steven Aurman, hosted on Hugging Face. The dataset includes tweet text plus ticker labels and sentiment, but it also contains URLs and sometimes embedded images. To keep the task focused on text, tweets containing images are removed, tweets shorter than 100 characters are filtered out, and URLs are stripped using a regular expression. The remaining examples are split into parquet training and test files (about 2,000 train rows and ~200 test rows), with deduplication efforts to reduce leakage.

A key twist is how the labels are generated and used. Tweet text is run through Gemini 2.5 Flash to extract (1) the sentiment class (bullish/neutral/bearish), (2) the relevant cryptocurrency tickers, and (3) a short “sentiment reasoning” sentence explaining why those labels were chosen. That reasoning sentence becomes part of the training target, not just the final classification. The dataset therefore includes both the structured JSON output (sentiment + tickers) and a natural-language justification.

Before training, the notebook checks label balance and token lengths. The sentiment distribution is described as roughly neutral/bullish balanced, while bearish is underrepresented—a potential red flag, though no rebalancing is performed in this run. Tokenization is done using the model’s chat template, and sequence lengths are measured; most examples fall well under 512 tokens, keeping training fast.

The baseline evaluation highlights why fine-tuning is needed. On a small sample of 50 test prompts, the untrained model parses correctly 43 times, but only achieves about 46.5% accuracy. Failures include outputs that get stuck in long “thinking” loops and cases where sentiment and tickers are both wrong.

Training uses 4-bit quantization (bitsandbytes) to fit the model on the T4, then applies supervised fine-tuning (SFT) with LoRA adapters. The adapter training targets only a small fraction of parameters—about 20 million trainable parameters, roughly 3.3% of the 600M model—so degradation is described as negligible. Two epochs are used; a second epoch improves test accuracy by about 6–7%. Training runs about 31 minutes, with warmup and a linear learning-rate schedule.

After merging the adapter back into the base model, inference becomes faster because the fine-tuned model’s “thinking” output is much shorter. On the full 200-example test set, accuracy reaches 66.5%, with the notebook also noting that JSON parsing succeeds more reliably than in the baseline evaluation. Experiments around “completion-only” training are discussed: excluding the reasoning tokens can preserve sentiment/ticker quality but hurts the model’s ability to generate the reasoning that the evaluation expects. Finally, a practical mitigation for rambling is introduced via a “max thinking tokens” processor inspired by Zach Mueller’s approach—forcing an end-of-thinking token once a budget is exceeded.

Overall, the run demonstrates that careful prompt formatting, label preprocessing, and adapter-based fine-tuning can materially improve structured sentiment extraction on crypto tweets, even under tight hardware constraints.

Cornell Notes

The notebook fine-tunes Quentry 3 (a ~600M instruction model) for cryptocurrency tweet sentiment and ticker extraction using only a single 16GB T4 GPU. Labels come from Gemini 2.5 Flash, including bullish/neutral/bearish sentiment, a list of tickers, and a short “sentiment reasoning” sentence that becomes part of the training target. After supervised fine-tuning with 4-bit quantization and LoRA-style adapters (about 20M trainable parameters), test accuracy rises from roughly 46% (untrained baseline on 50 examples) to 66.5% on 200 test examples. The improvement is attributed to structured chat prompting plus training on the reasoning tokens, which also reduces long “thinking” loops during inference.

How were the tweet dataset and labels prepared so the model could learn sentiment + tickers reliably?

The workflow starts from a financial tweets crypto dataset by Steven Aurman on Hugging Face. Tweets containing images are removed, tweets shorter than 100 characters are filtered out, and URLs are stripped with a regular expression. The remaining examples are split into parquet training and test sets (about 2,000 train rows and ~200 test rows) with deduplication efforts to reduce leakage. Gemini 2.5 Flash is used to extract (1) sentiment (bullish/neutral/bearish), (2) the cryptocurrency tickers mentioned, and (3) a one-sentence “sentiment reasoning” justification. That reasoning sentence is included in the training target alongside the structured JSON output.

Why does the notebook include “sentiment reasoning” in training instead of only training on the final sentiment/ticker labels?

The reasoning sentence is treated as part of the model’s output during supervised fine-tuning, so the model learns not just what labels to output but also how to justify them in the expected format. The notebook reports that experiments using “completion-only” (excluding the thinking/reasoning tokens from loss) can preserve sentiment/ticker quality while degrading the model’s ability to transfer the reasoning behavior. In other words, removing reasoning tokens can reduce the model’s alignment with the evaluation format that expects a justification.

What baseline behavior motivated fine-tuning, and how was baseline performance measured?

Before training, a text-generation pipeline is used to prompt the untrained Quentry 3 model with XML-formatted instructions and a JSON response template. In a small evaluation of about 50 examples, the model parses correctly 43 times, but accuracy is about 46.5%. Failures include getting stuck in long “thinking” loops (thousands of tokens) and cases where sentiment and tickers are both wrong. This baseline establishes both correctness and formatting reliability issues that fine-tuning aims to fix.

How does the training fit within a single T4 GPU, and what parts of the model are actually updated?

The model is loaded in 4-bit quantized form using bitsandbytes, with compute type set to bfloat16 and device mapping set to auto so it lands on the GPU. Instead of full fine-tuning, LoRA-style adapters are trained: the notebook sets a LoRA rank high enough that about 20 million parameters (roughly 3.3% of the 600M model) become trainable. The adapter training uses an SFT configuration with a linear learning-rate schedule and two epochs, then merges the adapter back into the base model for evaluation.

What evaluation result shows the fine-tuning worked, and what changed in inference behavior?

On the full 200-example test set, the fine-tuned model reaches 66.5% accuracy, compared with about 46% on the earlier 50-example baseline—an improvement of roughly 20 percentage points. Inference also changes qualitatively: the fine-tuned model’s “thinking” output becomes much shorter, reducing the chance of long rambling and speeding up evaluation (reported as ~7 minutes for all 200 examples, versus much longer for the baseline).

What practical technique is suggested to prevent excessive “thinking” during generation?

A “max thinking tokens” processor is introduced, inspired by a Zach Mueller approach that uses a slider-like thinking budget. The processor takes a max thinking token parameter and, if the model hasn’t stopped after that budget, forces an end-of-thinking token to cut off further generation. The notebook notes this can be applied to further reduce rambling in the fine-tuned model.

Review Questions

What preprocessing steps were applied to the raw tweets (images, length, URLs), and how might each step affect sentiment/ticker learning?
Why might training with reasoning tokens (instead of completion-only) improve structured sentiment extraction and output formatting?
How do 4-bit quantization and LoRA adapters together make fine-tuning feasible on a 16GB T4, and what fraction of parameters are updated?

Key Points

1
Fine-tuning Quentry 3 on ~2,000 labeled crypto tweets using a single 16GB T4 can raise sentiment-and-ticker extraction accuracy from about 46% to 66.5% on a 200-example test set.
2
Gemini 2.5 Flash labels include not only sentiment and tickers but also a short “sentiment reasoning” sentence, and that reasoning is included in the training target.
3
Dataset hygiene matters: tweets with images are removed, URLs are stripped, short tweets are filtered, and deduplication plus a fixed train/test split helps reduce leakage.
4
Baseline inference can fail in two ways—low accuracy and long “thinking” loops—so evaluation must consider both correctness and output parsing reliability.
5
The training strategy uses 4-bit quantization (bitsandbytes) plus LoRA-style adapters, updating ~20M parameters (~3.3% of the 600M model) rather than full fine-tuning.
6
Two epochs of SFT improved test accuracy by roughly 6–7%, while sequence lengths were kept well under 512 tokens to maintain fast training.
7
A max-thinking-token cutoff processor can force an end-of-thinking token to prevent rambling during generation, improving practical usability.

Highlights

Accuracy jumps from ~46% (untrained baseline on 50 examples) to 66.5% on 200 test examples after adapter fine-tuning on a single T4.

Training includes “sentiment reasoning” as supervised output; excluding it via completion-only can preserve sentiment/tickers but harms reasoning transfer.

The untrained model sometimes gets stuck generating long “thinking” sequences, motivating both fine-tuning and generation-time thinking limits.

LoRA adapters train only ~20M parameters (~3.3% of the 600M model) on top of a 4-bit quantized backbone, making the run feasible on 16GB VRAM.

A max-thinking-token processor is presented as a practical fix to cap reasoning length by injecting an end-of-thinking token.

Topics

Mentioned

Steven Aurman
Zach Mueller
LLM
GPU
VRAM
SFT
LoRA
JSON
VRAM
T4
4bit
SFT
RMS