Get AI summaries of any video or article — Sign up free
Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time thumbnail

Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Quantization strategy drives the largest latency gains: Transformers 8-bit loading reduced average inference time to about 3.04 seconds versus ~13.3 seconds baseline for Falcon 7B + QLoRA.

Briefing

Fine-tuned Falcon 7B inference speed can be cut dramatically by changing how the model is loaded—especially by running the quantized model in 8-bit. In a timed test using a single FAQ-style prompt (“What is your return policy?”) with a QLoRA adapter, the baseline setup averaged about 13.3 seconds per generation (with max_new_tokens set to 20). Switching from the original quantization configuration to Transformers’ built-in 4-bit loading reduced that to roughly 9.28 seconds, and moving to 8-bit dropped latency further to about 3.04 seconds while keeping the responses broadly similar.

The experiment starts with a Falcon 7B model plus a QLoRA adapter, loaded on a GPU instance with about 16GB VRAM (5.7GB for the model plus adapter). The setup uses Transformers’ causal language modeling stack, tokenization with left padding (to support batching later), and generation settings designed for speed and repeatability: max_new_tokens = 20 and temperature = 0 to avoid randomness. Each configuration is timed multiple times (five runs) under inference mode, with caching enabled to speed up subsequent generation.

Quantization is the first major lever. The baseline uses a bitsandbytes configuration aligned with the QLoRA paper-style approach, then compares it against Transformers’ native 4-bit quantization path. That change yields a clear improvement, but the biggest gain comes from loading the same fine-tuned model in 8-bit—even though the adapter was trained in 4-bit. The 8-bit run is not only faster; it also produces outputs that remain “pretty much the same” as the 4-bit approach in the example shown.

A second lever—batch inference—targets throughput when multiple prompts are available. By tokenizing several prompts together with padding and truncation, then generating in a single forward pass over a batch, the measured latency stays roughly comparable to single-prompt runs in this test. That suggests batching can speed up real workflows (more prompts processed per unit time) without sacrificing response quality, though the exact batch size depends on available GPU memory.

Other optimization attempts underperform in this specific setup. torch.compile, tested in combination with the model wrapper used here, fails to improve latency and even increases inference time across the same small number of trials. A “bonus” approach using lit parrot—an ecosystem built around efficient GPT-style implementations and support for adapters—also doesn’t beat the Transformers-based approach for this case. lit parrot’s inference time is about 3.87 seconds, with model loading taking much longer (around 145 seconds), and the example response differs because the model isn’t trained on the FAQ dataset used for the earlier tests.

Overall, the most reliable speedup in these experiments comes from using Transformers to load Falcon 7B + QLoRA in 8-bit, then using batch inference when multiple prompts are ready. The results also highlight that not every “speed” technique generalizes: compile-time optimizations and alternative inference libraries can help in other architectures or wrappers, but they don’t automatically deliver gains here.

Cornell Notes

Falcon 7B fine-tuned with a QLoRA adapter can generate much faster when the model is loaded in 8-bit rather than the original quantization setup. In timed tests on a GPU instance, baseline inference averaged about 13.3 seconds per prompt (max_new_tokens = 20, temperature = 0). Transformers’ native 4-bit loading improved latency to about 9.28 seconds, while 8-bit loading cut it to roughly 3.04 seconds with broadly similar outputs. Batching multiple prompts together with padding/truncation kept latency roughly steady per run, improving throughput for multi-prompt workloads. Attempts like torch.compile (with the used wrapper) and lit parrot did not outperform the Transformers 8-bit approach in this experiment.

Why does max_new_tokens matter for measuring inference speed in these experiments?

Generation time scales strongly with how many new tokens the model must produce. The tests fix max_new_tokens = 20 so comparisons across quantization modes (baseline, 4-bit, 8-bit) reflect speed differences from loading/compute rather than wildly different output lengths. With a larger max_new_tokens, the latency gap between configurations could widen or change because decoding dominates runtime.

How did quantization changes affect latency for Falcon 7B + QLoRA?

Using the original bitsandbytes configuration (baseline) averaged about 13.3 seconds per prompt. Switching to Transformers’ built-in 4-bit loading reduced it to about 9.28 seconds. Loading the same fine-tuned model in 8-bit cut latency further to about 3.04 seconds, while the example responses stayed broadly similar to the 4-bit case.

What role does caching play during generation in these timing runs?

The generation loop enables cache usage to speed up subsequent token generation steps. With caching, the model avoids recomputing earlier attention states for each new token, reducing per-token overhead. The timing procedure runs generation multiple times (five runs) under inference mode to capture stable averages while still benefiting from caching.

What does batch inference change, and why might it speed up real workloads even if per-run latency looks similar?

Batch inference packs multiple prompts into one batch using padding and truncation, then generates outputs together. In this test, latency for batched prompts looked roughly similar to single-prompt runs, but the throughput improves because several prompts are processed in one go. The feasible batch size depends on GPU memory (the experiment suggests four prompts could fit, possibly more).

Why didn’t torch.compile and lit parrot deliver better speed here?

torch.compile was tried with the model wrapper used in the setup and produced worse latency in the small set of trials. lit parrot required heavy setup and long model loading (about 145 seconds for initialization in the shown run) and delivered inference around 3.87 seconds—slower than the Transformers 8-bit result (~3.04 seconds). Additionally, the lit parrot example response differed because that model wasn’t trained on the FAQ dataset used for the earlier QLoRA evaluation.

What practical recipe emerges from the results?

For Falcon 7B with a QLoRA adapter, load with Transformers in 8-bit for the biggest latency reduction, keep generation settings tight (e.g., max_new_tokens = 20, temperature = 0 for deterministic speed tests), and use batch inference when multiple prompts are available. Then verify output quality, since speedups can come with changes in quantization behavior.

Review Questions

  1. If max_new_tokens were doubled, which part of the pipeline would likely dominate runtime and why?
  2. What evidence in the timing results supports choosing 8-bit over 4-bit for this Falcon 7B + QLoRA setup?
  3. How would you decide whether batch inference is beneficial for your own workload given GPU memory constraints?

Key Points

  1. 1

    Quantization strategy drives the largest latency gains: Transformers 8-bit loading reduced average inference time to about 3.04 seconds versus ~13.3 seconds baseline for Falcon 7B + QLoRA.

  2. 2

    Transformers’ native 4-bit loading improved latency too, bringing it down to roughly 9.28 seconds, but 8-bit delivered the biggest jump.

  3. 3

    Speed tests were controlled with max_new_tokens = 20 and temperature = 0, making token generation length and randomness consistent across runs.

  4. 4

    Enabling generation caching helps reduce per-token overhead during decoding, supporting faster repeated generation steps.

  5. 5

    Batch inference can improve throughput by processing multiple prompts together; per-run latency may stay similar while total work per unit time increases.

  6. 6

    torch.compile did not help in this specific wrapper-based setup and increased inference time in the tested trials.

  7. 7

    lit parrot underperformed the Transformers 8-bit approach here, with slower inference (~3.87 seconds) and much longer initialization time (~145 seconds).

Highlights

Loading Falcon 7B + QLoRA in 8-bit via Transformers cut average inference time from ~13.3 seconds to ~3.04 seconds for a 20-token generation.
Transformers’ built-in 4-bit loading improved speed (to ~9.28 seconds) but still lagged behind 8-bit.
Batch inference kept latency roughly steady while enabling multiple prompts to be handled together, improving throughput.
torch.compile and lit parrot didn’t beat the Transformers 8-bit result in this experiment, with compile-time changes increasing latency and lit parrot showing slower inference plus heavy initialization.
Output quality should be checked after quantization changes, since speedups can alter response details even when answers remain broadly similar.

Topics

Mentioned

  • LLM
  • QLoRA
  • VRAM
  • GPU
  • FAQ
  • GPU memory
  • T4