Fine-tuning Llama 3.2 on Your Data with a single GPU | Training LLM for Sentiment Analysis

TL;DR

Use TorchTune for LoRA-style single-device fine-tuning, with configurations that target attention modules.

Briefing Cornell Notes

Briefing

Fine-tuning Llama 3.2 (1B) for sentiment classification on a custom mental-health dataset can jump accuracy from roughly 30% to nearly 85% using a single GPU. The workflow builds a labeled dataset from Hugging Face, filters it to fit GPU memory constraints, converts each example into an instruction-style prompt that forces the model to output exactly one category, then applies LoRA-style fine-tuning via TorchTune with quantization-aware options available. After training for one epoch, the tuned model’s predictions stay within the allowed class list and the classification report improves across categories.

The process starts with TorchTune—described as a PyTorch-team library with ready configurations for many LLMs, including Llama variants, and support for single-device fine-tuning. The dataset contains about 53k rows from a Hugging Face repository focused on sentiment analysis for mental health. Each row pairs a text “statement” with a single label (“status”) drawn from seven categories, with a highly skewed class distribution. To reduce memory pressure during tokenization and training, the pipeline measures word counts (via counting spaces) and drops long examples—keeping 99.6% of data after removing statements over 1,000 words. It then caps examples per class (at most 5,000 per status) to rebalance the dataset, producing a smaller training set roughly half the original size.

For modeling, the approach uses an instruction-tuned Llama 3.2 1B model rather than the base model. The instruction prompt is explicit: “Classify this text for one of the categories… Choose from one of the category only… Reply only with the category,” followed by the statement and the list of valid labels. TorchTune requires input/output columns in a particular format, so the script generates JSON files for training and testing where the input is the prompt and the output is the normalized category label.

Before fine-tuning, the base instruct model is evaluated using the same prompt template and generation settings (including a token limit). Predictions often include formatting artifacts or labels outside the allowed class names, so the evaluation cleans outputs by stripping anything not in the class list. The resulting classification report is weak—around 30% accuracy—reflecting both the difficulty of the task (even humans can disagree on label assignment) and the model’s limited instruction-following for this specific label set.

Fine-tuning then uses a TorchTune configuration adapted from the LoRA/“war” setup, targeting attention modules with LoRA parameters such as rank and alpha (with zero dropout in the described config). Training runs on a single device (an NVIDIA L4 with 24GB VRAM is mentioned, with T4 as a possible alternative if batch size is reduced). The batch size is set to 4, and optimization uses AdamW with a cosine learning-rate schedule and warmup of 100 steps. Training takes about 35 minutes for roughly 20,000 examples on the L4, and metrics show loss behavior consistent with learning.

Afterward, the tuned checkpoint is packaged into a Hugging Face repository and evaluated with the same prompt and generation pipeline. The tuned model’s outputs remain within the valid categories, and accuracy rises to nearly 85%. The classification report improves across categories, with notable gains in precision for depression and personality disorder, and overall improvements in precision/recall/F1 scores for every label. The takeaway is that when the task is tightly defined (single-label output from a fixed set) and the dataset is curated to fit hardware limits, even a small 1B model can deliver large gains from lightweight fine-tuning on one GPU.

Cornell Notes

A single-GPU LoRA fine-tuning run can turn Llama 3.2 (1B) into a much more reliable mental-health sentiment classifier. The workflow builds a labeled dataset (~53k rows), filters out long texts (>1,000 words), and caps examples per class to reduce skew and memory load. Each example becomes an instruction prompt that forces the model to “reply only with the category,” and TorchTune trains the instruct model using a LoRA configuration targeting attention modules. Baseline evaluation lands near 30% accuracy after cleaning invalid outputs, while the fine-tuned model reaches nearly 85% accuracy and improves precision/recall/F1 across categories. The result highlights how prompt constraints plus curated data and LoRA training can dramatically improve label adherence and classification quality.

Why does the pipeline filter the dataset by word count and cap examples per class?

Tokenization and training cost scale with sequence length, so very long statements can trigger out-of-memory errors even if the model can technically accept long inputs. The pipeline counts spaces in each statement as a proxy for word count, then removes samples with more than 1,000 words—keeping about 99.6% of examples. Because the dataset is also highly skewed across seven labels, it then takes at most 5,000 examples per status to rebalance the distribution and reduce training burden, producing a dataset roughly two times smaller than the original.

What makes the instruction prompt important for classification accuracy?

The prompt is designed to constrain output format: it lists the allowed categories and instructs the model to choose exactly one and “reply only with the category.” This matters because baseline generations can include explanations, punctuation, or labels not present in the class list. During evaluation, the pipeline cleans predictions by removing anything outside the allowed categories; the fine-tuned model largely avoids invalid outputs, which improves measured accuracy and classification metrics.

Why fine-tune the instruct model instead of the base model?

The base model is harder to get to follow the strict instruction format (“reply only with the category”). The pipeline aims for a fair comparison between base and fine-tuned performance under the same prompting rules, so it keeps the base model unchanged for baseline testing while fine-tuning the instruction-tuned variant to better follow the label-selection task.

What training configuration choices enable single-GPU fine-tuning?

The described setup uses TorchTune with a LoRA-style method applied to attention modules, with rank/alpha parameters and zero dropout. It runs on a single device (NVIDIA L4 is cited; T4 is possible with smaller batch size). Batch size is set to 4, and optimization uses AdamW with a cosine learning-rate schedule and warmup of 100 steps. Training is limited to one epoch for speed, taking about 35 minutes on the L4 for roughly 20,000 examples.

How is accuracy measured given that the model may output invalid categories?

Predictions are compared to true labels after cleaning. The pipeline strips any generated text that doesn’t match one of the seven class names, preventing spurious outputs from being counted as correct. This cleaning step is crucial for the baseline model, which otherwise produces formatting artifacts or extra text; the fine-tuned model improves because its outputs stay within the valid class set.

What performance change validates that fine-tuning worked?

Baseline classification lands around 30% accuracy with weak precision/recall across categories. After fine-tuning, accuracy rises to nearly 85%, and the classification report improves across every category, including stronger precision for depression and personality disorder. The tuned model also stops generating categories outside the allowed list, indicating better instruction adherence.

Review Questions

How do filtering long texts and capping per-class examples jointly affect GPU memory usage and model learning for this single-label classification task?
What role does the “reply only with the category” instruction play in both baseline evaluation and fine-tuned performance?
Why does cleaning predictions to allowed class names matter when computing accuracy and classification reports?

Key Points

1
Use TorchTune for LoRA-style single-device fine-tuning, with configurations that target attention modules.
2
Build an instruction-style dataset where each training example forces the model to output exactly one label from a fixed set.
3
Filter out very long statements (>1,000 words) and cap examples per class (e.g., 5,000) to avoid out-of-memory errors and reduce class skew.
4
Evaluate baseline and fine-tuned models using the same prompt template and generation constraints, then clean outputs to allowed class names before scoring.
5
Fine-tune the instruct model (not the base model) when strict output formatting is required for fair comparison.
6
Run LoRA training with a practical batch size (e.g., 4 on an NVIDIA L4) and a cosine schedule with warmup (100 steps) for stable learning.
7
Expect large gains when the task is well-defined and the dataset is curated; the reported jump is from ~30% to ~85% accuracy.

Highlights

A curated mental-health dataset plus LoRA fine-tuning on a single GPU can raise Llama 3.2 (1B) accuracy from about 30% to nearly 85%.

Explicit prompt constraints (“reply only with the category”) reduce invalid outputs and make accuracy measurement more meaningful.

Filtering statements over 1,000 words and capping examples per label prevents out-of-memory failures and helps with skewed class distributions.

On an NVIDIA L4 (24GB VRAM), the described one-epoch LoRA run takes roughly 35 minutes with batch size 4 for around 20,000 examples.

After fine-tuning, predictions stay within the seven allowed categories, and precision/recall/F1 improve across all labels.

Topics

LoRA Fine-Tuning
Llama 3.2 1B
Sentiment Classification
Single-GPU Training
Prompt Engineering

Mentioned

Venelin Valkov
GPU
VRAM
LoRA
AdamW
T4
L4
F1