Falcon Soars to the Top - The NEW 40B LLM Rises above the rest.
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Falcon is presented as a from-scratch pre-trained model family with 40B and 7B parameter variants, not a simple fine-tune.
Briefing
Falcon has arrived as a new, from-scratch large language model family—anchored by a 40B parameter model—and it’s already topping Hugging Face’s Open LLM leaderboard. The immediate significance is practical: benchmark placement suggests strong general capability, while architectural and inference-focused optimizations (including flash attention and other speedups) aim to make that capability usable in real deployments.
Falcon comes in two main sizes: a 40 billion parameter model and a 7 billion parameter model. The 40B release is notable because models of that scale are relatively rare in the open ecosystem, with only a few comparable entries mentioned (such as LLaMA 65B). Even without a paper yet, Hugging Face benchmarking results place Falcon at the top across multiple tasks, including ARC-style reasoning and HellaSwag. On HellaSwag, the transcript cites GPT-4 reaching 95 in a few-shot setup, while GPT-3.5 lands around 85.5; Falcon’s reported score is about 85.3 in the same neighborhood—close enough to be competitive, but still enough to secure the leaderboard lead in the aggregated view.
The 7B model also performs well within its size class, with the transcript comparing it against MPT-7 and reporting Falcon 7B base scores around 78.1 versus MPT-7’s 76.1 (as presented in the cited leaderboard view). The takeaway is that Falcon’s base models look strong for their parameter budgets, though the creator repeatedly cautions that benchmarks depend on use case and that users should test directly.
Where the release gets contentious is licensing. Falcon is described as not fully free: if a company makes more than $1 million per year, a 10% royalty is expected. That has triggered debate online, and the transcript raises a strategic concern—whether businesses could restructure around the threshold by routing usage through an API provider that stays under the cap.
On Hugging Face, multiple variants are listed: Falcon 40B base, Falcon 40B instruct, Falcon 7B base, and Falcon 7B instruct. The instruct variants are central to the next story thread: early hands-on tests suggest the base models’ benchmark strength doesn’t automatically translate into better instruction-following. In the transcript’s code experiments, the 40B instruct model often produces refusal-style answers (e.g., declining to write an email to Sam Altman) and shows weak performance on certain reasoning and formatting tasks (including a miscalculation in an apples problem and difficulty with a haiku-in-a-tweet prompt). The 7B instruct model behaves differently—sometimes producing coherent, persuasive text where the 40B instruct model refuses—but it still struggles on the same kind of arithmetic reasoning.
Overall, the release lands as a mixed picture: Falcon’s base models look like serious contenders on public benchmarks, but the instruct-tuned versions may be trained on less effective data for instruction following. The transcript’s practical conclusion is that starting from Falcon 7B for further instruction fine-tuning—potentially using better curated datasets—may be the most productive path, while serving the 40B model remains hardware-intensive until quantization (like 4-bit) becomes straightforward enough for broader users.
Cornell Notes
Falcon is a newly released, from-scratch large language model family with two headline sizes: 40B and 7B parameters. Hugging Face leaderboard results place Falcon at or near the top across multiple tasks, including HellaSwag and reasoning-style evaluations, suggesting strong base-model capability. However, early instruction-tuned behavior (Falcon 40B instruct and Falcon 7B instruct) appears uneven in hands-on tests: the 40B instruct variant often refuses prompts and can miss arithmetic/formatting tasks, while the 7B instruct variant is more willing but still falters on some reasoning. Licensing adds a commercial constraint: a 10% royalty applies only after annual revenue exceeds $1 million. The practical implication is that Falcon’s base models look promising, but instruction fine-tuning strategy and dataset quality likely matter as much as raw parameter count.
What makes Falcon different from “fine-tuned” releases, and why does that matter for performance expectations?
How does Falcon’s reported benchmark performance compare to well-known models on HellaSwag?
Why is the licensing model likely to affect real-world adoption?
What pattern emerges when comparing Falcon base models versus Falcon instruct models in the transcript’s tests?
What hardware and implementation hurdles are mentioned for running the Falcon 40B model?
What does the transcript suggest as the best next step for users who want an instruction model?
Review Questions
- Which benchmark results in the transcript support the claim that Falcon’s base models are strong, and how close are they to GPT-3.5 on HellaSwag?
- How do the transcript’s hands-on examples differentiate Falcon 40B instruct behavior from Falcon 7B instruct behavior?
- What licensing threshold triggers the 10% royalty, and what operational strategy does the transcript speculate companies might use to manage it?
Key Points
- 1
Falcon is presented as a from-scratch pre-trained model family with 40B and 7B parameter variants, not a simple fine-tune.
- 2
Hugging Face leaderboard results place Falcon at the top across multiple tasks, with HellaSwag scores reported around 85.3 for Falcon in the cited few-shot setup.
- 3
The 7B model is reported to outperform MPT-7 within its size class in the transcript’s referenced leaderboard comparison.
- 4
Falcon’s license includes a 10% royalty once annual revenue exceeds $1 million, creating potential friction for commercial deployment.
- 5
Early instruction-tuned testing suggests the instruct variants can underperform expectations relative to the base models, especially on arithmetic and some formatting prompts.
- 6
Serving Falcon 40B is described as hardware-intensive (four GPUs and ~50GB VRAM even in 8-bit), with implementation details complicating quantized inference.
- 7
The transcript’s practical recommendation is to consider Falcon 7B as a starting point for further instruction fine-tuning using better curated datasets.