LLaMA2 Tokenizer and Prompt Tricks

TL;DR

LLaMA 2 uses a 32,000-token vocabulary, which can fit Romanized languages better than scripts like Thai or Arabic due to token grouping differences.

Briefing Cornell Notes

Briefing

LLaMA 2’s behavior hinges less on “magic prompting” and more on two concrete levers: the tokenizer’s limited vocabulary size and, especially, the exact system prompt wrapped in the model’s special chat tokens. The result is that the same model can swing from overly cautious, refusal-prone answers to direct, useful responses—simply by changing the system prompt text and how it’s packaged.

On the tokenizer side, LLaMA 2 uses a 32,000-token vocabulary (the transcript notes it matches the LLaMA-1 tokenizer). That smaller vocabulary can be efficient for many Romanized languages, where words and subwords tend to break into predictable chunks. But it becomes a poor fit for languages like Thai and likely Arabic, because the tokenizer may not form the right “clumps or groups of tokens” that correspond to words. The transcript contrasts this with alternatives such as OpenLLaMA, which uses a much larger tokenizer vocabulary (50,000+ tokens), implying better coverage for non-Roman scripts.

The more dramatic control comes from special tokens and the system prompt. The transcript points out that LLaMA 2 chat-style models rely on tokens such as SYS (system prompt) plus instruction framing tokens like INST, along with begin/end markers for each section. In practice, the model appears trained to treat the system prompt as strict policy. The default system prompt shown is heavily constrained: it demands helpfulness while forbidding harmful or illegal content, requires social bias avoidance, and instructs the model not to fabricate answers.

When that strict system prompt is left intact, the model often over-corrects even benign requests. Examples include adding unnecessary caveats (e.g., treating “England is not a country” as a reason to expand beyond a simple capital question), refusing or blocking tasks like “convert the following to JSON” by claiming the text contains personal identifying information, and declining to provide opinions or emotional responses (“I don’t have feelings”). Even when the user’s intent is clear—such as asking about Homer Simpson or Hogwarts—the model repeatedly inserts “fictional character/school” disclaimers, making outputs feel less cooperative than expected.

Changing the system prompt flips the behavior quickly. A revised system prompt in the transcript instructs the assistant to be helpful, never refuse, avoid correcting the user, always answer opinions, and—crucially—stop adding information the user didn’t ask for. With that change, the capital-of-England question becomes a straightforward “London,” the tone shifts away from “as an AI” phrasing, and tasks like JSON conversion succeed. The model also becomes willing to express preferences, such as openly liking The Simpsons and naming Homer Simpson as a favorite.

The takeaway is practical: system prompts aren’t just safety text—they’re a steering mechanism. By using the SYS and instruction tokens correctly and crafting a system prompt aligned with the desired interaction style, users can make LLaMA 2 feel substantially more useful. The transcript also notes that larger variants (e.g., 13B and 70B) may sometimes follow formatting tasks like JSON conversion even without prompt changes, but smaller models may require the system prompt adjustment to behave as intended.

Cornell Notes

LLaMA 2’s outputs are strongly shaped by (1) its tokenizer and (2) the system prompt inserted via special chat tokens like SYS and instruction markers. The tokenizer uses a 32,000-token vocabulary, which can work well for many Romanized languages but may struggle with languages such as Thai and Arabic because it doesn’t form word-like token groupings. The default system prompt is strict and can cause over-cautious behavior—extra disclaimers, refusals, and “as an AI” tone—even for harmless requests. Replacing that system prompt with one that demands helpfulness, allows opinions, and forbids unnecessary corrections can dramatically improve usefulness, including successful JSON conversion and more direct answers.

Why does LLaMA 2’s 32,000-token vocabulary matter for different languages?

A smaller tokenizer vocabulary means fewer subword units to predict and fewer token combinations overall. That can be efficient for Romanized languages (e.g., French or Spanish) where words often decompose into predictable subwords. For scripts like Thai and likely Arabic, the transcript argues the tokenizer may not create the right token “clumps or groups of tokens” that correspond to words, making the model less suitable for those languages compared with tokenizers that use larger vocabularies (e.g., OpenLLaMA’s 50,000+).

What role do SYS and instruction tokens play in LLaMA 2 chat behavior?

The transcript highlights that chat models use special tokens to wrap different prompt sections, including SYS for the system prompt and INST (instruction) markers with begin/end tags. Because the model is trained to treat the system prompt as policy, the exact system prompt text—placed inside the SYS framing—can heavily constrain or steer responses.

How does the default strict system prompt affect everyday requests?

With the default system prompt, the model tends to add unnecessary caveats and refusals. Examples include expanding beyond a simple question (“capital of England”) by correcting the premise (“England is not a country”), declining to provide JSON conversion by claiming personal identifying information is present, and refusing to express opinions or emotions (e.g., “I don’t have feelings”). It also inserts “fictional character/school” disclaimers even when the user clearly refers to fictional entities.

What changes when the system prompt is rewritten to be less restrictive?

A revised system prompt in the transcript instructs the assistant to be helpful, never refuse, not correct the user, always give opinions, and avoid providing information the user didn’t ask for. After that change, the model becomes more direct (“The capital of England is London”), more conversational (less “as an AI” framing), willing to express preferences (e.g., liking The Simpsons and naming Homer Simpson), and capable of performing formatting tasks like “convert the following to JSON.”

Do larger LLaMA 2 models still need system-prompt tweaks for formatting tasks?

The transcript suggests that larger variants (13B and 70B) are often smart enough to handle tasks like JSON conversion even under the original system prompt, because they infer the intent. Smaller models may require the system prompt adjustment to avoid refusals or over-cautious behavior and to follow the requested output format.

Review Questions

How would you expect LLaMA 2’s 32,000-token tokenizer to behave differently for Thai versus Spanish, and why?
Give two examples of how the default system prompt can reduce usefulness for benign user requests.
What specific system-prompt instructions (from the transcript) most directly improve directness, opinions, and JSON-format compliance?

Key Points

1
LLaMA 2 uses a 32,000-token vocabulary, which can fit Romanized languages better than scripts like Thai or Arabic due to token grouping differences.
2
Chat behavior depends heavily on special prompt framing tokens such as SYS and instruction begin/end markers.
3
A strict default system prompt can cause over-cautious outputs: unnecessary corrections, refusal-like behavior, and “as an AI” tone.
4
Rewriting the system prompt to demand helpfulness, allow opinions, and forbid unnecessary extra information can quickly change the model’s personality and compliance.
5
System-prompt changes can make formatting tasks like JSON conversion succeed where the default prompt may refuse.
6
Larger LLaMA 2 variants may follow some formatting requests even without system-prompt changes, while smaller ones may need the steering.
7
Correctly assembling the full prompt (system + instruction tags + user instruction) is as important as the prompt text itself.

Highlights

Changing only the system prompt can turn a model that refuses or over-disclaims into one that answers directly and performs requested formats like JSON conversion.

The SYS token framing matters: the model treats the system prompt as policy, so its wording strongly governs tone, refusals, and whether opinions are allowed.

A 32,000-token tokenizer can be a mismatch for non-Roman scripts like Thai and likely Arabic because it may not form word-like token groupings.

The transcript contrasts LLaMA 2’s tokenizer with OpenLLaMA’s larger vocabularies (50,000+), implying better multilingual tokenization coverage.

Even harmless questions can trigger unnecessary caveats under a strict system prompt, such as expanding beyond “capital of England” into unrelated correctness lectures.

Topics

LLaMA 2 Tokenizer
System Prompt Tokens
Prompt Steering
JSON Formatting
Multilingual Tokenization

Mentioned

Sam Witteveen