Vicuna - 90% of ChatGPT quality by using a new dataset?
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Vicuna is described as a LLaMa 13B fine-tune trained on ShareGPT conversation transcripts.
Briefing
Vicuna is being positioned as an open-source-style chat model that delivers roughly “90% of ChatGPT quality” by fine-tuning a LLaMa base model on conversation data scraped from ShareGPT—yet the dataset controversy around that source may keep the model from being usable commercially.
At the core of the claim is how Vicuna was built and how it was benchmarked. The model is essentially a fine-tuned LLaMa variant: it starts from LLaMa 13B and then trains on instruction-style conversational examples drawn from ShareGPT, a site where users post ChatGPT-like dialogue transcripts. The training set is larger and structured differently than earlier instruction-tuned efforts. Alpaca, for instance, is described as trained on 52,000 “self instruct” samples with a sequence length of 512. Vicuna, by contrast, uses about 70,000 conversation samples and expands training context from 512 up to 2048 tokens by packing multi-turn back-and-forth exchanges into a single training span.
The “90% quality” framing comes from an evaluation method that feeds outputs from multiple models—LLaMa, Alpaca, Bard, and Vicuna—into GPT-4 with a scoring prompt. In that setup, raw LLaMa scores lowest because it isn’t meaningfully instruction-tuned. Alpaca performs better, while Vicuna lands much closer to Bard in the reported comparisons. ChatGPT is treated as a near-ideal reference point because the scoring setup is designed to reward the target style.
Still, the benchmarking approach is explicitly treated as imperfect. The transcript notes that chatbot evaluation remains an open research problem: different models can excel under different prompting styles, so “best model” results can shift depending on how prompts are crafted. That uncertainty is why the comparisons also include qualitative checks—such as observing that Vicuna tends to generate longer responses than Alpaca—and why the discussion points to alternative evaluation ideas borrowed from speech research, like mean opinion score (MOS), where humans or structured judgments decide which output is more human-like.
The biggest practical constraint comes from the ShareGPT data controversy. ShareGPT reportedly removed its “explore” page after alleging that Google used the site’s data to train Bard. The dispute escalated in reporting that a Google researcher, Jacob Devlin (first author of the BERT paper), quit after raising concerns with senior leadership. Google later denied that Bard was trained on ShareGPT/ChatGPT data, but the fallout still matters for Vicuna: since Vicuna was trained on that ShareGPT-derived material, the resulting model is described as not commercially usable. Add the separate licensing limits around LLaMa itself, and the transcript paints a picture of models that are fine to test but hard to deploy.
Finally, the discussion touches on adjacent efforts at large labs, including DeepMind’s involvement in Gemini and references to DeepMind’s earlier internal system “Sparrow,” which aimed at features like citations. For now, Vicuna’s weights and training data are not fully released, though code for training/serving/evaluating is planned; the dataset itself is not expected to be shared. The result is a model that can be tried online, performs competitively in many prompts, and yet sits behind legal and data-access barriers that limit real-world adoption.
Cornell Notes
Vicuna is presented as a LLaMa-based chat model fine-tuned on ShareGPT conversation transcripts, aiming to reach “90% of ChatGPT quality.” The reported benchmark uses GPT-4 as a judge: it scores responses generated by LLaMa, Alpaca, Vicuna, and Bard under the same prompting framework, with Vicuna scoring much closer to Bard than Alpaca does. The evaluation method is acknowledged as non-rigorous because model performance can depend heavily on prompt style, making “quality” hard to measure consistently. A major complication is the ShareGPT data controversy, which—along with LLaMa and ChatGPT data licensing restrictions—limits commercial use. Vicuna can be tested via an online interface, but the training dataset is not expected to be released.
How does Vicuna get its performance, and what training data is it based on?
What does the “90% of ChatGPT quality” claim rely on?
Why is the benchmarking approach considered limited?
What training changes distinguish Vicuna from Alpaca in the transcript?
What controversy affects Vicuna’s usability beyond technical performance?
What is and isn’t released with Vicuna?
Review Questions
- What specific fine-tuning and context-length changes does the transcript claim differentiate Vicuna from Alpaca?
- How does the GPT-4-as-judge scoring method work in the described benchmark, and what bias might it introduce?
- Why does the ShareGPT controversy matter for commercial deployment even if the model performs well in tests?
Key Points
- 1
Vicuna is described as a LLaMa 13B fine-tune trained on ShareGPT conversation transcripts.
- 2
The “90% of ChatGPT quality” claim is based on GPT-4 scoring of responses from LLaMa, Alpaca, Vicuna, and Bard.
- 3
Chatbot benchmarking is treated as non-rigorous because results can vary with prompting strategies.
- 4
Vicuna’s training is portrayed as using longer contexts (up to 2048 tokens) via multi-turn conversation packing.
- 5
ShareGPT’s data-access controversy is linked to allegations about Bard training and may restrict commercial use of models trained on that data.
- 6
Vicuna’s release includes training/serving/evaluation code, but the training dataset is not expected to be released.
- 7
The transcript suggests future evaluation could borrow ideas like MOS-style judgments to reduce reliance on fragile prompt-dependent comparisons.