Falcon Soars to the Top - The NEW 40B LLM Rises above the rest.

TL;DR

Falcon is presented as a from-scratch pre-trained model family with 40B and 7B parameter variants, not a simple fine-tune.

Briefing Cornell Notes

Briefing

Falcon has arrived as a new, from-scratch large language model family—anchored by a 40B parameter model—and it’s already topping Hugging Face’s Open LLM leaderboard. The immediate significance is practical: benchmark placement suggests strong general capability, while architectural and inference-focused optimizations (including flash attention and other speedups) aim to make that capability usable in real deployments.

Falcon comes in two main sizes: a 40 billion parameter model and a 7 billion parameter model. The 40B release is notable because models of that scale are relatively rare in the open ecosystem, with only a few comparable entries mentioned (such as LLaMA 65B). Even without a paper yet, Hugging Face benchmarking results place Falcon at the top across multiple tasks, including ARC-style reasoning and HellaSwag. On HellaSwag, the transcript cites GPT-4 reaching 95 in a few-shot setup, while GPT-3.5 lands around 85.5; Falcon’s reported score is about 85.3 in the same neighborhood—close enough to be competitive, but still enough to secure the leaderboard lead in the aggregated view.

The 7B model also performs well within its size class, with the transcript comparing it against MPT-7 and reporting Falcon 7B base scores around 78.1 versus MPT-7’s 76.1 (as presented in the cited leaderboard view). The takeaway is that Falcon’s base models look strong for their parameter budgets, though the creator repeatedly cautions that benchmarks depend on use case and that users should test directly.

Where the release gets contentious is licensing. Falcon is described as not fully free: if a company makes more than $1 million per year, a 10% royalty is expected. That has triggered debate online, and the transcript raises a strategic concern—whether businesses could restructure around the threshold by routing usage through an API provider that stays under the cap.

On Hugging Face, multiple variants are listed: Falcon 40B base, Falcon 40B instruct, Falcon 7B base, and Falcon 7B instruct. The instruct variants are central to the next story thread: early hands-on tests suggest the base models’ benchmark strength doesn’t automatically translate into better instruction-following. In the transcript’s code experiments, the 40B instruct model often produces refusal-style answers (e.g., declining to write an email to Sam Altman) and shows weak performance on certain reasoning and formatting tasks (including a miscalculation in an apples problem and difficulty with a haiku-in-a-tweet prompt). The 7B instruct model behaves differently—sometimes producing coherent, persuasive text where the 40B instruct model refuses—but it still struggles on the same kind of arithmetic reasoning.

Overall, the release lands as a mixed picture: Falcon’s base models look like serious contenders on public benchmarks, but the instruct-tuned versions may be trained on less effective data for instruction following. The transcript’s practical conclusion is that starting from Falcon 7B for further instruction fine-tuning—potentially using better curated datasets—may be the most productive path, while serving the 40B model remains hardware-intensive until quantization (like 4-bit) becomes straightforward enough for broader users.

Cornell Notes

Falcon is a newly released, from-scratch large language model family with two headline sizes: 40B and 7B parameters. Hugging Face leaderboard results place Falcon at or near the top across multiple tasks, including HellaSwag and reasoning-style evaluations, suggesting strong base-model capability. However, early instruction-tuned behavior (Falcon 40B instruct and Falcon 7B instruct) appears uneven in hands-on tests: the 40B instruct variant often refuses prompts and can miss arithmetic/formatting tasks, while the 7B instruct variant is more willing but still falters on some reasoning. Licensing adds a commercial constraint: a 10% royalty applies only after annual revenue exceeds $1 million. The practical implication is that Falcon’s base models look promising, but instruction fine-tuning strategy and dataset quality likely matter as much as raw parameter count.

What makes Falcon different from “fine-tuned” releases, and why does that matter for performance expectations?

Falcon is described as trained from scratch with pre-training rather than being a simple fine-tune of an existing model. That matters because pre-training quality and architecture-level choices (like inference optimizations) can drive baseline capability, which is reflected in the transcript’s reliance on Hugging Face leaderboard results for the base models.

How does Falcon’s reported benchmark performance compare to well-known models on HellaSwag?

The transcript cites HellaSwag few-shot comparisons: GPT-4 scores 95 (few-shot), PaLM scores 87 (one-shot), and GPT-3.5 is around 85.5 in a few-shot setting. Falcon is reported at about 85.3 in the same neighborhood, meaning it’s competitive and narrowly behind GPT-3.5 on that specific metric, yet still positioned as the overall leaderboard leader in the cited view.

Why is the licensing model likely to affect real-world adoption?

Falcon is described as requiring royalties if a company makes more than $1 million per year, with a 10% royalty rate. That threshold can change procurement decisions for startups and enterprises, and the transcript raises a concern that usage could be routed through an API provider structured to remain under the cap.

What pattern emerges when comparing Falcon base models versus Falcon instruct models in the transcript’s tests?

The base models’ benchmark strength doesn’t consistently carry over to instruction-following. In the transcript’s examples, Falcon 40B instruct frequently responds with refusal-style language (e.g., declining to write an email to Sam Altman) and can fail arithmetic reasoning (the apples problem). The 7B instruct model sometimes produces more coherent responses, but it still struggles with the same kind of reasoning prompts.

What hardware and implementation hurdles are mentioned for running the Falcon 40B model?

Running Falcon 40B is described as a significant task even in 8-bit mode, requiring four GPUs and roughly 50GB of GPU RAM. The transcript also notes compatibility issues between the model’s custom inference code (including flash attention and decoding) and 8-bit casting, leading to auto-casting workarounds.

What does the transcript suggest as the best next step for users who want an instruction model?

The transcript leans toward starting with Falcon 7B and then performing instruction fine-tuning using higher-quality, filtered datasets (it mentions Vicuna-style or Wizard-model datasets as examples). It also notes Falcon’s instruct variants may not be ideal for further fine-tuning, implying dataset choice and training approach are key.

Review Questions

Which benchmark results in the transcript support the claim that Falcon’s base models are strong, and how close are they to GPT-3.5 on HellaSwag?
How do the transcript’s hands-on examples differentiate Falcon 40B instruct behavior from Falcon 7B instruct behavior?
What licensing threshold triggers the 10% royalty, and what operational strategy does the transcript speculate companies might use to manage it?

Key Points

1
Falcon is presented as a from-scratch pre-trained model family with 40B and 7B parameter variants, not a simple fine-tune.
2
Hugging Face leaderboard results place Falcon at the top across multiple tasks, with HellaSwag scores reported around 85.3 for Falcon in the cited few-shot setup.
3
The 7B model is reported to outperform MPT-7 within its size class in the transcript’s referenced leaderboard comparison.
4
Falcon’s license includes a 10% royalty once annual revenue exceeds $1 million, creating potential friction for commercial deployment.
5
Early instruction-tuned testing suggests the instruct variants can underperform expectations relative to the base models, especially on arithmetic and some formatting prompts.
6
Serving Falcon 40B is described as hardware-intensive (four GPUs and ~50GB VRAM even in 8-bit), with implementation details complicating quantized inference.
7
The transcript’s practical recommendation is to consider Falcon 7B as a starting point for further instruction fine-tuning using better curated datasets.

Highlights

Falcon is positioned as the new leader on Hugging Face’s Open LLM leaderboard, with reported HellaSwag performance around 85.3—close to GPT-3.5’s 85.5.

The licensing scheme is revenue-triggered: a 10% royalty applies only after $1 million per year, a detail that could shape how companies structure deployments.

Hands-on tests show a mismatch: strong base-model benchmark scores don’t guarantee strong instruction-following in Falcon 40B instruct.

Running Falcon 40B is described as a multi-GPU, high-VRAM challenge, with custom inference code complicating straightforward 8-bit execution.

Topics

Falcon LLM
Hugging Face Leaderboard
Model Licensing
Instruction Fine-Tuning
Flash Attention

Mentioned

Sam Altman
Geoffrey Hinton
George Washington
J K Rowling
LLM
API
GPU
VRAM