LLaMA2 Tokenizer and Prompt Tricks
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
LLaMA 2 uses a 32,000-token vocabulary, which can fit Romanized languages better than scripts like Thai or Arabic due to token grouping differences.
Briefing
LLaMA 2’s behavior hinges less on “magic prompting” and more on two concrete levers: the tokenizer’s limited vocabulary size and, especially, the exact system prompt wrapped in the model’s special chat tokens. The result is that the same model can swing from overly cautious, refusal-prone answers to direct, useful responses—simply by changing the system prompt text and how it’s packaged.
On the tokenizer side, LLaMA 2 uses a 32,000-token vocabulary (the transcript notes it matches the LLaMA-1 tokenizer). That smaller vocabulary can be efficient for many Romanized languages, where words and subwords tend to break into predictable chunks. But it becomes a poor fit for languages like Thai and likely Arabic, because the tokenizer may not form the right “clumps or groups of tokens” that correspond to words. The transcript contrasts this with alternatives such as OpenLLaMA, which uses a much larger tokenizer vocabulary (50,000+ tokens), implying better coverage for non-Roman scripts.
The more dramatic control comes from special tokens and the system prompt. The transcript points out that LLaMA 2 chat-style models rely on tokens such as SYS (system prompt) plus instruction framing tokens like INST, along with begin/end markers for each section. In practice, the model appears trained to treat the system prompt as strict policy. The default system prompt shown is heavily constrained: it demands helpfulness while forbidding harmful or illegal content, requires social bias avoidance, and instructs the model not to fabricate answers.
When that strict system prompt is left intact, the model often over-corrects even benign requests. Examples include adding unnecessary caveats (e.g., treating “England is not a country” as a reason to expand beyond a simple capital question), refusing or blocking tasks like “convert the following to JSON” by claiming the text contains personal identifying information, and declining to provide opinions or emotional responses (“I don’t have feelings”). Even when the user’s intent is clear—such as asking about Homer Simpson or Hogwarts—the model repeatedly inserts “fictional character/school” disclaimers, making outputs feel less cooperative than expected.
Changing the system prompt flips the behavior quickly. A revised system prompt in the transcript instructs the assistant to be helpful, never refuse, avoid correcting the user, always answer opinions, and—crucially—stop adding information the user didn’t ask for. With that change, the capital-of-England question becomes a straightforward “London,” the tone shifts away from “as an AI” phrasing, and tasks like JSON conversion succeed. The model also becomes willing to express preferences, such as openly liking The Simpsons and naming Homer Simpson as a favorite.
The takeaway is practical: system prompts aren’t just safety text—they’re a steering mechanism. By using the SYS and instruction tokens correctly and crafting a system prompt aligned with the desired interaction style, users can make LLaMA 2 feel substantially more useful. The transcript also notes that larger variants (e.g., 13B and 70B) may sometimes follow formatting tasks like JSON conversion even without prompt changes, but smaller models may require the system prompt adjustment to behave as intended.
Cornell Notes
LLaMA 2’s outputs are strongly shaped by (1) its tokenizer and (2) the system prompt inserted via special chat tokens like SYS and instruction markers. The tokenizer uses a 32,000-token vocabulary, which can work well for many Romanized languages but may struggle with languages such as Thai and Arabic because it doesn’t form word-like token groupings. The default system prompt is strict and can cause over-cautious behavior—extra disclaimers, refusals, and “as an AI” tone—even for harmless requests. Replacing that system prompt with one that demands helpfulness, allows opinions, and forbids unnecessary corrections can dramatically improve usefulness, including successful JSON conversion and more direct answers.
Why does LLaMA 2’s 32,000-token vocabulary matter for different languages?
What role do SYS and instruction tokens play in LLaMA 2 chat behavior?
How does the default strict system prompt affect everyday requests?
What changes when the system prompt is rewritten to be less restrictive?
Do larger LLaMA 2 models still need system-prompt tweaks for formatting tasks?
Review Questions
- How would you expect LLaMA 2’s 32,000-token tokenizer to behave differently for Thai versus Spanish, and why?
- Give two examples of how the default system prompt can reduce usefulness for benign user requests.
- What specific system-prompt instructions (from the transcript) most directly improve directness, opinions, and JSON-format compliance?
Key Points
- 1
LLaMA 2 uses a 32,000-token vocabulary, which can fit Romanized languages better than scripts like Thai or Arabic due to token grouping differences.
- 2
Chat behavior depends heavily on special prompt framing tokens such as SYS and instruction begin/end markers.
- 3
A strict default system prompt can cause over-cautious outputs: unnecessary corrections, refusal-like behavior, and “as an AI” tone.
- 4
Rewriting the system prompt to demand helpfulness, allow opinions, and forbid unnecessary extra information can quickly change the model’s personality and compliance.
- 5
System-prompt changes can make formatting tasks like JSON conversion succeed where the default prompt may refuse.
- 6
Larger LLaMA 2 variants may follow some formatting requests even without system-prompt changes, while smaller ones may need the steering.
- 7
Correctly assembling the full prompt (system + instruction tags + user instruction) is as important as the prompt text itself.