Get AI summaries of any video or article — Sign up free
Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU thumbnail

Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Build a domain dataset by converting question–context–answer records into the model’s chat-template format, then tokenize to measure length.

Briefing

Fine-tuning Meta’s Llama 3 8B Instruct on a domain-specific Q&A dataset can be done on a single GPU by combining 4-bit quantization with a LoRA-style adapter (called “LoRA” in the transcript as “war adapter”). The practical payoff is a model that answers financial questions in a format closer to the target dataset and behaves more reliably than the untouched base model—without the cost of updating all 8B parameters.

The workflow starts by building a training dataset from question–context–answer pairs. The example dataset is “Financial Q&A 10K” from Hugging Face, with about 7,000 records and fields including question, context, answer, plus extra columns like filing and ticker that get ignored for this run. Each record is converted into a chat-style prompt using the tokenizer’s chat template, then tokenized to measure length. To keep training efficient and stable, examples exceeding 512 tokens are dropped; the remaining data is sampled down and split into roughly 4,000 training examples, 500 validation, and 400 test.

On the modeling side, the base system is the “Llama 3 8B Instruct” model loaded in 4-bit using the nf4 quantization scheme. Because the tokenizer for this model lacks a padding token in the transcript’s setup, a new “<|pad|>” token is added and the embedding matrix is resized accordingly. The transcript highlights a failure mode seen during earlier attempts: without a padding token, generation can become repetitive or run on indefinitely, and fine-tuning the base model without the padding token can still produce looping behavior.

Instead of full fine-tuning, training updates only a small fraction of parameters via LoRA. The adapter targets linear layers across the attention and MLP components (including query/key/value projections and other linear modules), with a rank of 32 and alpha of 16. With this approach, only about 1.34% (roughly 84 million) of the model’s ~8B parameters are trained, making the run feasible on consumer hardware. The transcript reports training on a single NVIDIA T4, with the adapter training taking about two hours per run.

Before and after training, the model is evaluated using a generation pipeline that removes the ground-truth answer from the prompt and asks the model to produce up to 128 new tokens. A key implementation detail improves both speed and correctness of the loss: the training collator masks labels for the prompt portion (setting them to -100) so the loss is computed only on the completion tokens, not the already-provided input text.

After training, the adapter is merged back into the base model and the resulting model is pushed to Hugging Face Hub as “Llama 3 8B Instruct Finance R” with sharded weights capped at 5 GB. Qualitative comparisons on sampled test prompts suggest the fine-tuned model produces more compact, dataset-aligned answers and avoids some of the verbose or off-format behavior seen in the untrained baseline. The overall message is that domain fine-tuning for RAG-style Q&A can be practical on one GPU when quantization, LoRA, careful token-length filtering, and completion-only loss masking are combined.

Cornell Notes

The transcript lays out a single-GPU recipe for adapting Llama 3 8B Instruct to a financial RAG-style Q&A task. It builds a chat-formatted dataset from Financial Q&A 10K, tokenizes each example, drops sequences longer than 512 tokens, and splits into train/validation/test. The base model is loaded in 4-bit (nf4) and a padding token is added to prevent repetitive generation. Instead of updating all weights, LoRA trains only ~1.34% of parameters by targeting linear layers in attention and MLP blocks. Training uses a completion-only loss mask (labels set to -100 for the prompt) and the adapter is merged into the base model before uploading to Hugging Face Hub.

Why add a padding token and resize embeddings when fine-tuning Llama 3 8B Instruct?

In the transcript’s setup, the tokenizer for Llama 3 8B Instruct does not include a pad token. Adding a “<|pad|>” token and resizing the model’s embeddings is used to prevent generation issues observed during earlier runs—most notably repetitive output and, in some cases, generation that appears to continue indefinitely. The padding token also helps stabilize batching behavior when training with multiple examples per batch.

What makes LoRA/adapter fine-tuning workable on a single GPU here?

Full fine-tuning of an 8B model is typically too expensive for a single GPU. The transcript uses a LoRA-style adapter (“war adapter”) that freezes the base model and trains only adapter weights. With rank r=32 and alpha=16, training updates about 1.34% of parameters (~84M out of ~8B), targeting linear layers across attention (including query/key/value projections) and the MLP portion. This keeps memory and compute within the T4’s limits.

How does the training code ensure loss is computed only on the model’s answer, not the prompt?

A completion-only language modeling approach is used. The collator masks prompt tokens by setting their labels to -100, so the loss is calculated only for tokens after the prompt’s final token (the transcript refers to using the final header ID as the boundary). This avoids penalizing the model for reproducing the input text and can speed up training.

Why filter out examples longer than 512 tokens?

The transcript tokenizes each formatted chat example and builds a token-count histogram. The dataset is heavily skewed toward shorter sequences, but a few examples exceed 512 tokens. Those longer examples are removed to keep training efficient and to avoid pushing sequences beyond the configured max length (max_seq_length=512), which helps prevent instability and reduces wasted compute.

What evaluation method is used to compare the base model vs the fine-tuned model?

Both models are run through a text-generation pipeline with the ground-truth answer removed from the prompt. The pipeline uses generation with a cap of 128 new tokens. The transcript reports qualitative comparisons on sampled test prompts, noting that the fine-tuned model’s outputs are more aligned with the dataset’s answer style and less prone to the base model’s verbosity or formatting drift.

What happens after training, and why merge the adapter?

After training, the adapter is saved and then applied/merged into the base model to produce a single consolidated model. The transcript notes merging was done on a P100 due to memory constraints for the T4. The merged model and tokenizer are then uploaded to Hugging Face Hub as “Llama 3 8B Instruct Finance R,” with weights sharded to keep each shard under 5 GB.

Review Questions

  1. What specific generation failure mode was observed when fine-tuning without adding a padding token, and how does adding <|pad|> address it?
  2. How does masking labels to -100 for prompt tokens change the training objective compared with standard next-token loss over the entire sequence?
  3. Which LoRA target modules (attention and MLP linear layers) are selected in the transcript, and how do rank/alpha settings influence the number of trainable parameters?

Key Points

  1. 1

    Build a domain dataset by converting question–context–answer records into the model’s chat-template format, then tokenize to measure length.

  2. 2

    Load Llama 3 8B Instruct in 4-bit (nf4) to fit on a single GPU, and add a padding token if the tokenizer lacks one to prevent repetitive generation.

  3. 3

    Use LoRA/adapter fine-tuning to train only a small fraction of parameters (about 1.34% in this setup) by targeting linear layers in attention and MLP blocks.

  4. 4

    Filter training examples by token length (drop sequences >512 tokens) to keep training stable and efficient.

  5. 5

    Compute loss only on completion tokens by masking prompt labels to -100, so the model isn’t penalized for reproducing the input.

  6. 6

    Evaluate by removing the ground-truth answer from the prompt and generating up to a fixed number of new tokens (128 here), then compare outputs qualitatively.

  7. 7

    Merge the trained adapter into the base model and upload the consolidated model to Hugging Face Hub for easier deployment in RAG pipelines.

Highlights

A single-GPU fine-tuning path combines 4-bit nf4 quantization with LoRA adapters, training only ~84M of ~8B parameters.
Adding a missing padding token is treated as essential to stop repetitive or runaway generation behavior during fine-tuning.
Completion-only loss masking (labels set to -100 for prompt tokens) focuses training on the answer portion and can speed up learning.
Token-length filtering (dropping examples over 512 tokens) keeps the training run efficient and avoids sequence-length problems.
After training, merging the adapter into the base model produces a deployable single model uploaded to Hugging Face Hub as “Llama 3 8B Instruct Finance R.”

Topics

Mentioned