Mistral 7B - better than Llama 2? | Getting started, Prompt template & Comparison with Llama 2

TL;DR

Mistral 7B Instruct is claimed to outperform larger Llama 2 variants on standard benchmarks despite having about 7–7.3B parameters.

Briefing Cornell Notes

Briefing

Mistral 7B Instruct is positioned as a smaller model that can outperform larger Llama 2–class competitors, and hands-on tests in a Google Colab notebook largely back that up—especially for coding and multi-step reasoning—while Llama 2 still looks stronger for writing-style tasks and summarization.

The core performance claim comes straight from the Mistral 7B paper: despite being roughly 7–7.3 billion parameters (far smaller than Llama 2’s 13B and even 27B variants), Mistral 7B Instruct is reported to surpass those larger models across standard benchmarks the authors evaluated. The paper attributes the gap to architectural choices aimed at efficiency and longer context handling. Two techniques get the spotlight: Grouped Query Attention, which is described as improving inference speed and reducing memory requirements so larger batch sizes are feasible during inference; and Sliding Window Attention, which helps the model work with longer sequences by focusing attention over windows rather than treating all tokens uniformly. The transcript also notes that attention to middle portions of long text can be less reliable than attention to the beginning or end, and sliding-window methods are presented as a way to mitigate that.

Beyond architecture, the model’s practical accessibility matters. Mistral 7B is released under the Apache 2.0 license and is available on Hugging Face from the original authors at Mistral AI. For deployment, the notebook uses Transformers plus supporting libraries (accelerate and bitsandbytes) and loads the instruct model with an automatic device map to place it on a T4 GPU. The setup quantizes the model to 8-bit to fit hardware constraints, while keeping some layers in 16-bit before converting them down, targeting a workable memory footprint (about 5.5 GB RAM and roughly 8.1 GB VRAM on the referenced GPU).

In the comparison tasks, Mistral 7B Instruct produces generally strong outputs but doesn’t dominate every category. When asked to impersonate Dwight Schrute from The Office and draft an email, Mistral’s response is described as more generic and less convincing than Llama 2’s impersonation. For finance-themed reasoning, Mistral gives a plausible-sounding but non-specific answer, while Llama 2 is judged to better match the character’s voice.

Coding and reasoning show the clearest edge for Mistral. In Python function tasks—like computing the square of a sum—Mistral generates working code, while Llama 2’s version is reported to be wrong in a quick test. For list-splitting into three equal parts, both models appear to work, though Mistral is still preferred. For reading comprehension and table-based math, Mistral is reported to get the correct values and produce a correct percentage-increase calculation, while Llama 2’s outputs are described as incorrect or incomplete and Mistral’s arithmetic is framed as more reliable.

Overall, the transcript’s takeaway is a tradeoff: Mistral 7B Instruct looks better for coding and multi-step reasoning, while Llama 2 remains more convincing for summarization and writing-like tasks. The inference speed is described as broadly comparable, not dramatically faster or slower.

Cornell Notes

Mistral 7B Instruct is a ~7–7.3B parameter model that claims benchmark wins over larger Llama 2 variants (13B and 27B) using efficiency-focused attention methods. The paper highlights Grouped Query Attention for faster inference and lower memory use, plus Sliding Window Attention to better handle longer sequences. In a Colab-based setup, the model is loaded with Transformers and quantized to 8-bit using bitsandbytes, with some layers kept at 16-bit, running on a T4 GPU. Side-by-side tests suggest Mistral performs better on coding and multi-step reasoning tasks, while Llama 2 looks stronger for writing and summarization-style outputs. The comparison also includes table-driven reading comprehension and percentage calculations, where Mistral is reported to be more accurate.

What two attention mechanisms are credited for Mistral 7B’s efficiency and long-context behavior?

The transcript points to Grouped Query Attention and Sliding Window Attention. Grouped Query Attention is described as increasing inference speed and reducing memory requirements, enabling higher batch sizes during inference. Sliding Window Attention is presented as a way to support longer sequences; it also addresses a common issue where middle parts of long text may be less important than the beginning or end, by limiting attention to windows rather than treating the full sequence uniformly.

How does the notebook make Mistral 7B practical on limited GPU memory?

It loads the model with Transformers using an automatic device map (device_map='auto') so it can place the model on the available T4 GPU instance. It then quantizes the model to 8-bit using bitsandbytes and accelerate. The setup keeps some layers in 16-bit first and then converts them to 8-bit, targeting about 5.5 GB RAM and about 8.1 GB VRAM on the referenced GPU.

Where does Mistral 7B Instruct outperform Llama 2 in the transcript’s task comparisons?

Mistral is favored on coding and reasoning. For example, when asked to write a Python function to compute the square of a sum of two numbers, Mistral’s function is described as correct on a quick test, while Llama 2’s function produces an incorrect result. In table-based reading comprehension and percentage calculations, Mistral is reported to compute correct values and a correct percentage increase, while Llama 2’s responses are described as incorrect or incomplete.

Where does Llama 2 look stronger than Mistral 7B in these tests?

Llama 2 is judged better for writing and summarization-style tasks. In an impersonation email task (Dwight Schrute), Mistral’s response is described as generic and less convincing, while Llama 2’s impersonation is considered more accurate. For summarizing benefits from a passage, the transcript says Llama 2 produces a clearer, more readable structured list than Mistral.

What role do benchmark claims and real-world tests play in the transcript’s conclusion?

The transcript starts with the paper’s claim that Mistral 7B Instruct outperforms larger Llama 2 models across standard benchmarks despite having fewer parameters. It then uses hands-on notebook tasks to test specific behaviors. The final conclusion blends both: benchmark positioning supports the idea of strong capability per parameter, while the task results refine where that capability shows up most clearly (coding/reasoning vs writing/summarization).

Review Questions

Which architectural features in the Mistral 7B paper are linked to faster inference and better long-context handling, and how are they described?
In the notebook setup, what quantization approach is used and what hardware footprint is reported for RAM and VRAM?
Compare the transcript’s examples of where Mistral 7B is preferred versus where Llama 2 is preferred, citing at least one coding/reasoning task and one writing/summarization task.

Key Points

1
Mistral 7B Instruct is claimed to outperform larger Llama 2 variants on standard benchmarks despite having about 7–7.3B parameters.
2
Grouped Query Attention is described as improving inference speed and lowering memory use, enabling larger batch sizes during inference.
3
Sliding Window Attention is presented as a mechanism for handling longer sequences and reducing reliance on less relevant middle text.
4
The Colab deployment uses Transformers with an automatic device map and 8-bit quantization via bitsandbytes (with some layers initially in 16-bit).
5
The transcript reports Mistral 7B doing better on coding correctness and multi-step table/math reasoning than Llama 2.
6
Llama 2 is described as stronger for writing-style outputs such as impersonation and structured summarization.
7
Inference speed is described as broadly comparable between the two models in these tests, not dramatically different.

Highlights

Mistral 7B’s paper claims wins over 13B and even 27B competitors using efficiency-focused attention methods rather than sheer parameter count.

The practical setup quantizes Mistral 7B to 8-bit to fit on a T4 GPU, targeting about 8.1 GB VRAM.

In the transcript’s coding tests, Mistral’s Python function for square-of-a-sum is described as working, while Llama 2’s version fails on a quick check.

For table-driven reading comprehension and percentage calculations, Mistral is reported to produce correct arithmetic where Llama 2’s outputs are described as wrong or incomplete.

Llama 2 is repeatedly favored for writing and summarization tasks, including a Dwight Schrute impersonation email.

Topics

Mistral 7B Instruct
Llama 2 Comparison
Grouped Query Attention
Sliding Window Attention
8-bit Quantization

Mentioned

GPU
VRAM
RAM
T4
8-bit
16-bit