Vicuna - 90% of ChatGPT quality by using a new dataset?

TL;DR

Vicuna is described as a LLaMa 13B fine-tune trained on ShareGPT conversation transcripts.

Briefing Cornell Notes

Briefing

Vicuna is being positioned as an open-source-style chat model that delivers roughly “90% of ChatGPT quality” by fine-tuning a LLaMa base model on conversation data scraped from ShareGPT—yet the dataset controversy around that source may keep the model from being usable commercially.

At the core of the claim is how Vicuna was built and how it was benchmarked. The model is essentially a fine-tuned LLaMa variant: it starts from LLaMa 13B and then trains on instruction-style conversational examples drawn from ShareGPT, a site where users post ChatGPT-like dialogue transcripts. The training set is larger and structured differently than earlier instruction-tuned efforts. Alpaca, for instance, is described as trained on 52,000 “self instruct” samples with a sequence length of 512. Vicuna, by contrast, uses about 70,000 conversation samples and expands training context from 512 up to 2048 tokens by packing multi-turn back-and-forth exchanges into a single training span.

The “90% quality” framing comes from an evaluation method that feeds outputs from multiple models—LLaMa, Alpaca, Bard, and Vicuna—into GPT-4 with a scoring prompt. In that setup, raw LLaMa scores lowest because it isn’t meaningfully instruction-tuned. Alpaca performs better, while Vicuna lands much closer to Bard in the reported comparisons. ChatGPT is treated as a near-ideal reference point because the scoring setup is designed to reward the target style.

Still, the benchmarking approach is explicitly treated as imperfect. The transcript notes that chatbot evaluation remains an open research problem: different models can excel under different prompting styles, so “best model” results can shift depending on how prompts are crafted. That uncertainty is why the comparisons also include qualitative checks—such as observing that Vicuna tends to generate longer responses than Alpaca—and why the discussion points to alternative evaluation ideas borrowed from speech research, like mean opinion score (MOS), where humans or structured judgments decide which output is more human-like.

The biggest practical constraint comes from the ShareGPT data controversy. ShareGPT reportedly removed its “explore” page after alleging that Google used the site’s data to train Bard. The dispute escalated in reporting that a Google researcher, Jacob Devlin (first author of the BERT paper), quit after raising concerns with senior leadership. Google later denied that Bard was trained on ShareGPT/ChatGPT data, but the fallout still matters for Vicuna: since Vicuna was trained on that ShareGPT-derived material, the resulting model is described as not commercially usable. Add the separate licensing limits around LLaMa itself, and the transcript paints a picture of models that are fine to test but hard to deploy.

Finally, the discussion touches on adjacent efforts at large labs, including DeepMind’s involvement in Gemini and references to DeepMind’s earlier internal system “Sparrow,” which aimed at features like citations. For now, Vicuna’s weights and training data are not fully released, though code for training/serving/evaluating is planned; the dataset itself is not expected to be shared. The result is a model that can be tried online, performs competitively in many prompts, and yet sits behind legal and data-access barriers that limit real-world adoption.

Cornell Notes

Vicuna is presented as a LLaMa-based chat model fine-tuned on ShareGPT conversation transcripts, aiming to reach “90% of ChatGPT quality.” The reported benchmark uses GPT-4 as a judge: it scores responses generated by LLaMa, Alpaca, Vicuna, and Bard under the same prompting framework, with Vicuna scoring much closer to Bard than Alpaca does. The evaluation method is acknowledged as non-rigorous because model performance can depend heavily on prompt style, making “quality” hard to measure consistently. A major complication is the ShareGPT data controversy, which—along with LLaMa and ChatGPT data licensing restrictions—limits commercial use. Vicuna can be tested via an online interface, but the training dataset is not expected to be released.

How does Vicuna get its performance, and what training data is it based on?

Vicuna is described as fine-tuning a LLaMa model (specifically the 13B variant). The fine-tuning data comes from ShareGPT, a site where users post conversation transcripts resembling ChatGPT dialogue. The transcript contrasts this with Alpaca’s smaller “self instruct” dataset and notes that Vicuna uses more conversation samples and a longer effective context window.

What does the “90% of ChatGPT quality” claim rely on?

The transcript says the comparison uses GPT-4 to score outputs. It generates responses from LLaMa, Alpaca, Bard, and Vicuna, then sends those outputs into GPT-4 with a prompt that produces a score. LLaMa scores lowest because it isn’t strongly instruction-tuned; Alpaca improves; Vicuna scores much closer to Bard; ChatGPT is treated as the top reference because the scoring target aligns with ChatGPT-like behavior.

Why is the benchmarking approach considered limited?

Because chatbot evaluation is sensitive to prompting. A model can look strong under one prompting strategy and weak under another, so “best model” conclusions can change when prompt styles change. The transcript also notes that building robust evaluation systems for chatbots remains an open research question, unlike more standardized evaluation in some other domains.

What training changes distinguish Vicuna from Alpaca in the transcript?

Vicuna is trained on about 70,000 conversation samples and expands sequence length from 512 (associated with Alpaca) up to 2048. The expansion is achieved by using multi-round conversations packed into a single training span—e.g., a question, an answer, follow-ups, and subsequent answers within the same context window.

What controversy affects Vicuna’s usability beyond technical performance?

ShareGPT reportedly removed its explore page after alleging Google used ShareGPT data to train Bard. Reporting tied the dispute to Google researcher Jacob Devlin, who allegedly quit after raising concerns with senior leadership. Even though Google denied training Bard on ShareGPT/ChatGPT data, the transcript emphasizes that Vicuna’s reliance on that data makes the model not usable for commercial purposes, especially when combined with LLaMa’s licensing limits.

What is and isn’t released with Vicuna?

The transcript says release materials include code for training, serving, and evaluating, with plans to release weights “in some way,” but no plan to release the dataset. That means the key ShareGPT-derived training data is not expected to be available for others to reproduce the results.

Review Questions

What specific fine-tuning and context-length changes does the transcript claim differentiate Vicuna from Alpaca?
How does the GPT-4-as-judge scoring method work in the described benchmark, and what bias might it introduce?
Why does the ShareGPT controversy matter for commercial deployment even if the model performs well in tests?

Key Points

1
Vicuna is described as a LLaMa 13B fine-tune trained on ShareGPT conversation transcripts.
2
The “90% of ChatGPT quality” claim is based on GPT-4 scoring of responses from LLaMa, Alpaca, Vicuna, and Bard.
3
Chatbot benchmarking is treated as non-rigorous because results can vary with prompting strategies.
4
Vicuna’s training is portrayed as using longer contexts (up to 2048 tokens) via multi-turn conversation packing.
5
ShareGPT’s data-access controversy is linked to allegations about Bard training and may restrict commercial use of models trained on that data.
6
Vicuna’s release includes training/serving/evaluation code, but the training dataset is not expected to be released.
7
The transcript suggests future evaluation could borrow ideas like MOS-style judgments to reduce reliance on fragile prompt-dependent comparisons.

Highlights

Vicuna’s performance claim hinges on GPT-4 acting as a scoring judge for outputs from multiple models, with Vicuna landing much closer to Bard than Alpaca does.

Training context length is a key differentiator: Alpaca is associated with 512 tokens, while Vicuna expands to 2048 by packing multi-turn dialogues into one span.

Even with strong benchmark results, ShareGPT-derived training data and LLaMa/ChatGPT licensing limits are presented as major blockers for commercial deployment.

The evaluation method is acknowledged as incomplete because “quality” can shift dramatically with how prompts are written.

Topics

Vicuna
LLaMa Fine-Tuning
ShareGPT Dataset
Chatbot Benchmarking
Bard Controversy

Mentioned

Jacob Devlin
Sundar Pichai
Jeff Dean
GPT-4
BERT
RLHF
MOS