OpenLLaMA: Open-Source Reproduction of Meta AI's LLaMA for Commercial Use. Run in Google Colab.

TL;DR

OpenLLaMA provides open checkpoints (including a 300B-token version) trained on RedPajama, and reproducing behavior depends on using the same checkpoint scale.

Briefing Cornell Notes

Briefing

OpenLLaMA (a 7B-parameter, open-source LLaMA-style model) can be run in Google Colab using Hugging Face Transformers, but getting usable text depends heavily on how generation is implemented. The model’s released checkpoints—trained on RedPajama—are available at least at 200B and 300B token training scales, and the walkthrough focuses on the 300B-token checkpoint. While the repository provides the weights and training details, the practical takeaway is that straightforward Transformers “generate” prompting produced garbled output on a T4-class GPU (16GB VRAM), forcing a switch to a custom decoding/sampling approach.

On the evaluation side, the OpenLLaMA release includes comparisons against Meta’s original LLaMA across common benchmarks (e.g., CB/F1 and accuracy-style metrics). The reported results show OpenLLaMA is often competitive—sometimes outperforming LLaMA on specific measures—though it can also lag on certain tasks. The broader point is that OpenLLaMA is not just a weight dump: it comes with training context (RedPajama data, long token training) and benchmark-style evidence that the model is functioning.

For local execution, the workflow starts by cloning the OpenLLaMA repository and then downloading the Hugging Face-compatible weights from the included folder structure (weights aren’t usable just by pointing at the repo root). Dependencies are installed, the model is loaded for causal language modeling, and the tokenizer is taken from the Transformers library. The run is configured for CUDA (GPU) and uses float16 loading; attempts to load in 8-bit failed in this setup, suggesting compatibility or tooling constraints.

The first generation attempt uses standard generation settings (e.g., max new tokens and temperature) with a typical prompt format. Token IDs are produced from the tokenizer, attention masks are passed to the model, and inference runs in PyTorch inference mode. Even though the model generates text, the decoded output is described as “not working as expected,” indicating that default generation/prompting choices weren’t aligned with how this model behaves.

To fix that, the walkthrough borrows a chat-oriented implementation from a separate OpenLLaMA 7B hands-on repository by Tom Misawa. The key change is custom top-k sampling (k=10): at each step, the code restricts candidate next tokens to the top 10 by probability, applies softmax over that subset, and samples from the resulting distribution. This manual token-by-token loop uses cached model states on the first pass and then feeds only the latest token on subsequent passes, stopping on the end-of-sequence token or when the max token budget is reached.

With top-k sampling, one factual prompt about the world’s tallest building returns a plausible answer (Burj Khalifa). But other prompts—like asking for a favorite phrase in a “microG scroll from the office” style format, or open-ended investment advice—produce incoherent or seemingly copied responses. The overall conclusion is pragmatic: OpenLLaMA is genuinely free and open-source to run, but reliable outputs require careful decoding (and likely better prompt formatting), not just a plug-and-play call to Transformers’ default generation.

Cornell Notes

OpenLLaMA is a 7B-parameter, open-source LLaMA-style model trained on RedPajama, with checkpoints at 200B and 300B tokens. In a Google Colab + Hugging Face Transformers setup, loading the model on a CUDA GPU (T4, 16GB VRAM) works with float16, but default Transformers generation produced poor, garbled text. Switching to a custom decoding loop—specifically top-k sampling with k=10—improved results, including a factual answer about the world’s tallest building. Even then, some prompts still yield nonsensical or seemingly copied outputs, suggesting that both sampling strategy and prompt formatting matter for this model.

What OpenLLaMA checkpoint and training data does the walkthrough rely on, and why does it matter for reproduction?

The setup uses the 300B-token checkpoint from the OpenLLaMA repository. The model is trained on the RedPajama dataset, which is open and large-scale. Using the same checkpoint matters because different training lengths (the repo also mentions a 200B-token checkpoint) can change behavior, quality, and how the model responds under the same inference code.

Why did the initial Hugging Face Transformers generation produce “not working as expected” output?

The walkthrough reports that a standard generate-based approach—tokenizing a prompt, passing input_ids and attention_mask, and using typical generation settings—returned text that was clearly incorrect or garbled. The likely issue is a mismatch between default generation/prompting assumptions and what this specific model/checkpoint expects in practice.

What decoding change improved results, and how does top-k sampling work here?

The code switches to top-k sampling with k=10. At each generation step, it takes the top 10 candidate next tokens by probability, applies softmax over those candidates, then samples one token using a multinomial draw. This is implemented in a token-by-token loop that stops on the end-of-sequence token or when max new tokens is reached.

How does the custom generation loop reduce compute across tokens?

On the first pass, the model consumes the full input sequence and uses cached state. On subsequent passes, it feeds only the most recent token ID (rather than re-sending the entire prompt), while sampling the next token from the model’s logits. This is the typical “cached decoding” pattern for faster autoregressive generation.

What kinds of prompts still failed even after top-k sampling?

Some prompts remained incoherent. Examples include asking for a “favorite phrase” using a “microG scroll from the office” style prompt, and investment advice prompts that returned nonsensical or seemingly copied text. Another prompt about the “best team make and model” of a manual gearbox produced a plausible-sounding but inconsistent answer (e.g., GMC C10 / GMC Sierra variants), reinforcing that output quality varies by prompt.

What hardware and model-loading constraints appear in this setup?

The run targets CUDA (GPU) and uses a T4 with 16GB VRAM. The model is loaded with float16. Attempts to load in 8-bit failed in this environment, suggesting either the checkpoint isn’t compatible with that quantization path or the tooling configuration wasn’t set up for it.

Review Questions

How would you modify the decoding strategy if a causal LLM produces garbled text under default Transformers generation?
What role do end-of-sequence (EOS) tokens and cached decoding play in the custom top-k sampling loop?
Why might two prompts—one factual and one conversational—produce very different quality from the same OpenLLaMA checkpoint?

Key Points

1
OpenLLaMA provides open checkpoints (including a 300B-token version) trained on RedPajama, and reproducing behavior depends on using the same checkpoint scale.
2
Hugging Face Transformers can load the model for causal language modeling on CUDA using float16, but 8-bit loading may fail depending on compatibility.
3
Default Transformers “generate” settings produced unusable output in this setup, indicating that generation configuration and/or prompt formatting can be critical.
4
Custom top-k sampling (k=10) improved output quality, including a correct-seeming factual response about the world’s tallest building.
5
Token-by-token generation with cached states speeds inference by avoiding reprocessing the full prompt each step.
6
Even with improved sampling, some prompts still yield nonsensical or copied-like text, so prompt design remains a major variable.

Highlights

The 300B-token OpenLLaMA checkpoint runs in Colab with Hugging Face Transformers, but default generation produced garbled text on a T4/16GB setup.

Switching to a custom top-k sampling loop (k=10) turned at least one factual query into a plausible answer (Burj Khalifa).

Top-k sampling helped, but it didn’t guarantee coherent responses across conversational and advice-style prompts.

Topics

OpenLLaMA
Hugging Face Transformers
Top-k Sampling
Google Colab
CUDA Inference

Mentioned

Venelin Valkov
Tom Misawa
LLaMA
GPU
VRAM
CUDA
EOS
T4
PyTorch
CB
F1