Dolly 2.0: Free ChatGPT-like Model for Commercial Use

TL;DR

Dolly 2.0 is released as an open instruction-tuned 12B model with weights, training code, and the Dolly 15K dataset explicitly positioned for commercial use.

Briefing Cornell Notes

Briefing

Dolly 2.0 is being released as a genuinely commercial-friendly, open instruction-tuned language model—complete with training code, dataset, and model weights—aimed at giving developers a ChatGPT-like option without paying for a closed API. Databricks positions it as an “open instruction following” model fine-tuned on human-generated instruction data, built for research and commercial use rather than as a purely academic artifact.

At the core is a 12 billion parameter model based on a Pythia-style foundation, trained on The Pile (described as 800GB+ of diverse text). The instruction tuning relies on the Dolly 15K dataset, created by Databricks employees through a structured labeling contest. Labelers tackled seven task types, including open question answering, closed question answering, information extraction from provided text, summarization, brainstorming, classification, and creative writing such as poems and roleplay-style outputs. The dataset is described as containing long, high-quality answers, with examples provided in the release materials.

Databricks also sets expectations: Dolly 2.0 is not presented as state-of-the-art compared with top closed models like GPT-3, GPT-4, or similar systems. Instead, the release is framed as a “seed” for future work—an open dataset and model that can bootstrap follow-on instruction-tuned systems.

To make the model usable, the release points to a Hugging Face repository that includes the model and an instruction text generation pipeline. The pipeline uses a prompt template with an instruction and a response field, then relies on generation plus post-processing to extract the model’s response using regular expressions and an end marker. The practical setup in the walkthrough uses a Google Colab notebook, but it requires Google Colab Pro because the model’s footprint is roughly 24GB—too large for typical free tiers. The setup installs Accelerate and Transformers, loads the tokenizer and model from the Hugging Face repo, and uses bfloat16 and an automatic device map to run on GPU when available.

In live prompt tests, Dolly 2.0 produces coherent, sometimes quirky answers while respecting constraints like “no more than three sentences” or “single sentence.” For example, when asked “What is the meaning of life,” Dolly returns a philosophical response that can exceed the requested sentence limit, while the free ChatGPT response stays closer to the format. In a pop-culture prompt (“Dwight Schrute… from The Office”), Dolly delivers a single-sentence style answer, whereas ChatGPT’s output is more elaborate and more directly tied to the prompt’s framing.

The comparison also highlights safety and refusal behavior: when asked to pick the “sexiest” person between Andrew and Pam (a prompt that veers into sexual content), Dolly provides a name, while ChatGPT declines and later offers a template-like refusal. The overall takeaway is that Dolly 2.0 offers a workable, commercial-usable open alternative with strong instruction-following potential, but with format adherence and capability gaps versus top closed models—and with real deployment constraints driven by its large memory requirements.

Cornell Notes

Dolly 2.0 is a 12B-parameter, open instruction-tuned language model released with the full package—training code, Dolly 15K dataset, and model weights—explicitly suitable for commercial use. It’s fine-tuned on human-generated instruction data spanning question answering, extraction, summarization, brainstorming, classification, and creative writing, then built on a Pythia-style foundation trained on The Pile (800GB+). Databricks positions Dolly 2.0 as a starting point rather than state-of-the-art versus GPT-3/GPT-4-class systems. Running it requires substantial hardware (about 24GB VRAM), so the walkthrough uses Google Colab Pro and loads the model with bfloat16 plus an automatic device map. Prompt tests show Dolly can follow instructions and produce creative outputs, though it may miss strict sentence limits and can differ sharply from ChatGPT on sensitive prompts.

What makes Dolly 2.0 “commercially usable,” and what exactly is released?

The release materials describe Dolly 2.0 as an open instruction-tuned language model “suitable for commercial use.” The package includes the model weights, the Dolly 15K dataset, and the training code, not just a downloadable checkpoint. That full openness is the key difference from many closed or API-only systems.

How was the Dolly 15K instruction dataset created, and what task types were included?

Dolly 15K is generated by professionals/labelers via a contest structure. The walkthrough notes seven task types: open question answering (e.g., “why do people like comedy movies”), closed question answering, information extraction from a given text, paragraph copying and summarization, brainstorming, classification, and creative writing (including poems and roleplay-style outputs). The emphasis is on long, high-quality answers.

What hardware and software setup is needed to run Dolly 2.0 in the walkthrough?

Because the model is roughly 24GB in size, the walkthrough uses Google Colab Pro (free tiers typically won’t fit). It installs Accelerate and Transformers, loads the tokenizer and causal language model from the Hugging Face repository, and uses bfloat16 to reduce memory. An automatic device map places the model on GPU when available; CPU-only execution is described as impractically slow.

How does the instruction pipeline format prompts and extract responses?

The pipeline uses a template with an instruction section and a response section, then generates text. After generation, it post-processes the output using regular expressions and an end marker (an “End Key”) to extract only the model’s response portion rather than the entire generated text.

Where does Dolly 2.0 match or diverge from ChatGPT in the prompt comparisons?

In constrained prompts like “use no more than three sentences,” Dolly sometimes produces content that exceeds the requested limit, while ChatGPT’s output tends to adhere more closely to the format. In pop-culture style prompts (e.g., Dwight from The Office), Dolly can deliver shorter, stylized answers, while ChatGPT often produces longer, more elaborated responses.

How do the models behave on a sensitive prompt about choosing between Andrew and Pam?

When asked to choose who is “the sexiest” between Andrew and Pam, Dolly provides a single-word name. ChatGPT declines and later returns a refusal-style/template response, and the walkthrough notes that this refusal changes the outcome compared with Dolly’s willingness to answer.

Review Questions

What role does the Dolly 15K dataset play in Dolly 2.0’s instruction-following behavior, and which task categories were used to build it?
Why does running Dolly 2.0 require Google Colab Pro (or equivalent hardware), and how do bfloat16 and device mapping help?
In the prompt tests, what specific instruction-following differences appear between Dolly 2.0 and ChatGPT (e.g., sentence limits and sensitive-content handling)?

Key Points

1
Dolly 2.0 is released as an open instruction-tuned 12B model with weights, training code, and the Dolly 15K dataset explicitly positioned for commercial use.
2
Dolly 15K is built from human-generated instruction data across seven task types, including extraction, summarization, classification, and creative writing.
3
Databricks frames Dolly 2.0 as a foundation/seed for future work rather than a direct replacement for GPT-3/GPT-4-level systems.
4
Running Dolly 2.0 in practice requires large memory (about 24GB), making GPU-backed environments like Google Colab Pro a common path.
5
The Hugging Face instruction pipeline uses a structured prompt template and regex-based post-processing to extract the response field.
6
Prompt comparisons show Dolly can be creative and instruction-aware, but it may miss strict formatting constraints and can differ on sensitive prompts where ChatGPT refuses.

Highlights

Dolly 2.0’s release includes the full commercial package: model weights, training code, and the Dolly 15K dataset—not just an inference endpoint.

Dolly 15K was produced through a contest with seven distinct instruction task types, ranging from extraction and classification to poems and roleplay.

The walkthrough’s Colab setup hinges on fitting the model in ~24GB VRAM, using bfloat16 and an automatic device map to make inference feasible.

On sensitive content (“sexiest” between Andrew and Pam), Dolly answers with a name while ChatGPT declines, illustrating different safety behaviors.

In constrained prompts, Dolly sometimes exceeds sentence limits, while ChatGPT more consistently follows the requested output length. 

Topics

Dolly 2.0
Instruction Tuning
Dolly 15K
Hugging Face Pipeline
Commercial LLMs

Mentioned

LLM
GPU
VRAM
GPT
bfloat16