Get AI summaries of any video or article — Sign up free
Fine-tuning LLM with QLoRA on Single GPU: Training Falcon-7b on ChatBot Support FAQ Dataset thumbnail

Fine-tuning LLM with QLoRA on Single GPU: Training Falcon-7b on ChatBot Support FAQ Dataset

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

QLoRA enables Falcon 7B fine-tuning on a single GPU by freezing the base model and training only low-rank adapter weights (about 0.13% of parameters in this run).

Briefing

Fine-tuning Falcon 7B on a single GPU is practical even with a tiny, FAQ-style dataset—using QLoRA to train only a small slice of parameters. The workflow takes roughly minutes end-to-end on a Google Cloud Tesla T4 setup and produces noticeably more on-dataset, support-like answers after training. The key takeaway is that parameter-efficient fine-tuning can steer a general-purpose open model toward a specific chatbot domain without needing large clusters or massive datasets.

The process starts by setting up the environment for 4-bit QLoRA training: bitsandbytes with PyTorch 2.0.1, Hugging Face Transformers, Accelerate, and a QLoRA configuration that enables double quantization while computing in 16-bit precision. Falcon 7B weights are loaded in 4-bit, and a tokenizer is initialized from the Falcon 7B checkpoint, with the padding token aligned to the EOS token. Instead of updating the full model, QLoRA freezes the base model and trains low-rank adapter matrices (LoRA) on top—so only about 0.13% of parameters become trainable. Gradient checkpointing is enabled to trade compute for reduced GPU memory usage.

Training data comes from the “e-commerce-fake-Q-chatbot” dataset on Hugging Face, credited to Muhammad Mahmud. The example used here is intentionally small: 79 question–answer pairs (about 20 KB) extracted from a JSON file and flattened into a simple question/answer structure. Each example is formatted into an instruction-style prompt using a “Human:” / “Assistant:” template, then tokenized with truncation and padding. The dataset is shuffled and mapped into model-ready features (input IDs, attention masks, and related fields).

The fine-tuning run uses a single epoch with batch size 1, a cosine learning-rate scheduler, and a paged AdamW optimizer in 8-bit—an efficiency-focused choice integrated through the Transformers stack. Training is performed with a language-modeling objective that predicts next tokens (no masked language modeling), and TensorBoard tracks loss/learning-rate behavior. Loss decreases smoothly and the learning rate converges close to zero, with the run reported at about 7.5 minutes for this small dataset.

Before training, the base Falcon 7B tends to fall into a repetitive, nonsensical loop for the “create an account” prompt, repeatedly asking for an email and password. After applying the saved QLoRA adapter (stored as a small ~19 MB adapter artifact plus configuration), the same prompt yields a more coherent, support-style response: directing the user to the signup page and describing the registration/confirmation-email flow. Additional FAQ prompts—tracking orders, return policy, and clearance/final-sale returns—also produce more aligned answers, often referencing typical policy constraints (e.g., return windows, condition requirements) and escalating to customer support when appropriate.

Overall, the experiment demonstrates that QLoRA can turn Falcon 7B into a more reliable FAQ chatbot on a single GPU, provided the prompt format matches the training data and the adapter is applied correctly at inference time.

Cornell Notes

QLoRA fine-tuning makes it feasible to adapt Falcon 7B on a single GPU using a very small FAQ dataset. The setup loads Falcon 7B in 4-bit with double quantization and trains only low-rank adapter weights on top of a frozen base model, updating about 0.13% of parameters. A dataset of 79 e-commerce chatbot question–answer pairs is flattened, formatted as “Human:”/“Assistant:” prompts, tokenized, and trained for one epoch with a cosine scheduler and paged AdamW (8-bit). After training, the model’s responses shift from repetitive, off-target behavior to more coherent, support-like answers aligned with the dataset. This matters because it shows domain steering without large compute or massive datasets.

Why does QLoRA make single-GPU fine-tuning realistic for Falcon 7B?

QLoRA freezes the original Falcon 7B weights and trains only small low-rank adapter matrices (LoRA) inserted into the model. In this run, the trainable portion is about 0.13% of all parameters. Combined with 4-bit loading (bitsandbytes) and double quantization, the method reduces memory enough to fit training on a Tesla T4-class environment while still allowing learning through the adapter weights.

What dataset structure and prompt format are used to teach the chatbot behavior?

The training data is 79 e-commerce FAQ question–answer pairs from the “e-commerce-fake-Q-chatbot” dataset. The JSON is flattened so each row has a root-level “question” and “answer.” Each example is formatted into a prompt template with “Human:” followed by the question and “Assistant:” followed by the answer, then tokenized with padding/truncation for training.

How is the model evaluated before and after fine-tuning, and what changes?

Inference uses the same “Human:”/“Assistant:” prompt style and generation settings (max new tokens, temperature, top_p). Before fine-tuning, the “create an account” prompt produces a repetitive loop asking for email/password repeatedly. After applying the trained QLoRA adapter, the response becomes more coherent and policy-like—directing the user to the signup page and describing registration and confirmation email steps. Similar improvements appear for order tracking and return-policy questions.

Which training choices help the run converge efficiently on limited hardware?

The run uses gradient checkpointing to reduce memory, batch size 1, one epoch, and a cosine learning-rate scheduler with warm-up. It also uses paged AdamW in 8-bit (via Transformers integration) to cut optimizer memory/compute overhead. Training loss decreases steadily and the learning rate decays toward near-zero, indicating stable convergence even with the tiny dataset.

What exactly gets saved and reused for inference after training?

After training, the base model remains unchanged; the saved artifact is the QLoRA adapter (reported as about 19 MB) plus a JSON config containing QLoRA hyperparameters like alpha, rank, and target modules. At inference, the adapter is loaded and applied on top of the Falcon 7B base model (with the same quantization configuration), then generation is run with the prompt template.

Review Questions

  1. How does QLoRA reduce the number of trainable parameters compared with full fine-tuning, and what percentage was trainable in this run?
  2. What role do prompt formatting (“Human:”/“Assistant:”) and tokenization choices play in aligning outputs with the FAQ dataset?
  3. Why might generation-time behavior differ from training-time behavior even when the same model and adapter are used?

Key Points

  1. 1

    QLoRA enables Falcon 7B fine-tuning on a single GPU by freezing the base model and training only low-rank adapter weights (about 0.13% of parameters in this run).

  2. 2

    Loading Falcon 7B in 4-bit with double quantization and computing in 16-bit precision reduces memory enough for practical training on a Tesla T4-class setup.

  3. 3

    A tiny FAQ dataset (79 question–answer pairs) can still steer outputs toward support-like responses when prompts are formatted consistently with training.

  4. 4

    Training stability came from gradient checkpointing, a cosine learning-rate schedule with warm-up, and paged AdamW in 8-bit.

  5. 5

    After training, applying the saved QLoRA adapter (about 19 MB) changes responses from repetitive loops to more coherent, dataset-aligned answers for account creation, returns, and order tracking.

  6. 6

    Inference uses the same “Human:”/“Assistant:” prompt template and generation settings, but generation is slower than training because it involves iterative token generation and sampling/search.

Highlights

Only ~0.13% of Falcon 7B parameters were updated thanks to QLoRA adapters, making single-GPU training feasible.
A base model that looped on “create an account” produced a coherent signup-page response after the adapter was applied.
Paged AdamW (8-bit) plus a cosine scheduler helped the loss decrease smoothly even with just 79 training examples.
The saved output is primarily the adapter (~19 MB) and its config; the base Falcon 7B weights are reused unchanged.

Topics

  • QLoRA Fine-Tuning
  • Falcon 7B
  • Single GPU Training
  • Chatbot FAQ Dataset
  • Paged AdamW

Mentioned

  • LLM
  • QLoRA
  • GPU
  • LoRA
  • EOS
  • T4
  • JSON
  • GPU
  • FAQ
  • AWS
  • AWS
  • MLX
  • ML expert Pro
  • Kegel
  • Vora
  • paged AdamW
  • 8-bit
  • 16-bit
  • TensorBoard