Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Quantization strategy drives the largest latency gains: Transformers 8-bit loading reduced average inference time to about 3.04 seconds versus ~13.3 seconds baseline for Falcon 7B + QLoRA.
Briefing
Fine-tuned Falcon 7B inference speed can be cut dramatically by changing how the model is loaded—especially by running the quantized model in 8-bit. In a timed test using a single FAQ-style prompt (“What is your return policy?”) with a QLoRA adapter, the baseline setup averaged about 13.3 seconds per generation (with max_new_tokens set to 20). Switching from the original quantization configuration to Transformers’ built-in 4-bit loading reduced that to roughly 9.28 seconds, and moving to 8-bit dropped latency further to about 3.04 seconds while keeping the responses broadly similar.
The experiment starts with a Falcon 7B model plus a QLoRA adapter, loaded on a GPU instance with about 16GB VRAM (5.7GB for the model plus adapter). The setup uses Transformers’ causal language modeling stack, tokenization with left padding (to support batching later), and generation settings designed for speed and repeatability: max_new_tokens = 20 and temperature = 0 to avoid randomness. Each configuration is timed multiple times (five runs) under inference mode, with caching enabled to speed up subsequent generation.
Quantization is the first major lever. The baseline uses a bitsandbytes configuration aligned with the QLoRA paper-style approach, then compares it against Transformers’ native 4-bit quantization path. That change yields a clear improvement, but the biggest gain comes from loading the same fine-tuned model in 8-bit—even though the adapter was trained in 4-bit. The 8-bit run is not only faster; it also produces outputs that remain “pretty much the same” as the 4-bit approach in the example shown.
A second lever—batch inference—targets throughput when multiple prompts are available. By tokenizing several prompts together with padding and truncation, then generating in a single forward pass over a batch, the measured latency stays roughly comparable to single-prompt runs in this test. That suggests batching can speed up real workflows (more prompts processed per unit time) without sacrificing response quality, though the exact batch size depends on available GPU memory.
Other optimization attempts underperform in this specific setup. torch.compile, tested in combination with the model wrapper used here, fails to improve latency and even increases inference time across the same small number of trials. A “bonus” approach using lit parrot—an ecosystem built around efficient GPT-style implementations and support for adapters—also doesn’t beat the Transformers-based approach for this case. lit parrot’s inference time is about 3.87 seconds, with model loading taking much longer (around 145 seconds), and the example response differs because the model isn’t trained on the FAQ dataset used for the earlier tests.
Overall, the most reliable speedup in these experiments comes from using Transformers to load Falcon 7B + QLoRA in 8-bit, then using batch inference when multiple prompts are ready. The results also highlight that not every “speed” technique generalizes: compile-time optimizations and alternative inference libraries can help in other architectures or wrappers, but they don’t automatically deliver gains here.
Cornell Notes
Falcon 7B fine-tuned with a QLoRA adapter can generate much faster when the model is loaded in 8-bit rather than the original quantization setup. In timed tests on a GPU instance, baseline inference averaged about 13.3 seconds per prompt (max_new_tokens = 20, temperature = 0). Transformers’ native 4-bit loading improved latency to about 9.28 seconds, while 8-bit loading cut it to roughly 3.04 seconds with broadly similar outputs. Batching multiple prompts together with padding/truncation kept latency roughly steady per run, improving throughput for multi-prompt workloads. Attempts like torch.compile (with the used wrapper) and lit parrot did not outperform the Transformers 8-bit approach in this experiment.
Why does max_new_tokens matter for measuring inference speed in these experiments?
How did quantization changes affect latency for Falcon 7B + QLoRA?
What role does caching play during generation in these timing runs?
What does batch inference change, and why might it speed up real workloads even if per-run latency looks similar?
Why didn’t torch.compile and lit parrot deliver better speed here?
What practical recipe emerges from the results?
Review Questions
- If max_new_tokens were doubled, which part of the pipeline would likely dominate runtime and why?
- What evidence in the timing results supports choosing 8-bit over 4-bit for this Falcon 7B + QLoRA setup?
- How would you decide whether batch inference is beneficial for your own workload given GPU memory constraints?
Key Points
- 1
Quantization strategy drives the largest latency gains: Transformers 8-bit loading reduced average inference time to about 3.04 seconds versus ~13.3 seconds baseline for Falcon 7B + QLoRA.
- 2
Transformers’ native 4-bit loading improved latency too, bringing it down to roughly 9.28 seconds, but 8-bit delivered the biggest jump.
- 3
Speed tests were controlled with max_new_tokens = 20 and temperature = 0, making token generation length and randomness consistent across runs.
- 4
Enabling generation caching helps reduce per-token overhead during decoding, supporting faster repeated generation steps.
- 5
Batch inference can improve throughput by processing multiple prompts together; per-run latency may stay similar while total work per unit time increases.
- 6
torch.compile did not help in this specific wrapper-based setup and increased inference time in the tested trials.
- 7
lit parrot underperformed the Transformers 8-bit approach here, with slower inference (~3.87 seconds) and much longer initialization time (~145 seconds).