Vicuna: An Open-Source Chatbot Comparable to ChatGPT and Google Bard

TL;DR

Vicuna is a 13B-parameter chatbot fine-tuned from Meta’s LLaMA using ShareGPT conversation data.

Briefing Cornell Notes

Briefing

Vicuna is an open-source chatbot built to deliver ChatGPT-like quality without matching OpenAI’s closed model approach. The project centers on a 13 billion-parameter model fine-tuned from Meta’s LLaMA using conversation data drawn from ShareGPT (user prompts and ChatGPT responses). At the time of recording, the model weights weren’t publicly available due to unresolved legal concerns, but a web demo was live—allowing direct side-by-side testing against ChatGPT.

The most consequential claim behind Vicuna is its reported quality: the authors say it reaches “90% of ChatGPT” quality when judged by GPT-4. That evaluation method matters because it uses GPT-4 as a judge to score responses across helpfulness, relevance, accuracy, and detail. In the reported results, Vicuna generally outperforms other open alternatives such as Alpaca and LLaMA, landing close to Google’s Bard and slightly below ChatGPT in the GPT-4-based ranking. Alpaca appears noticeably weaker in these comparisons, while Vicuna is described as “overwhelmingly” better than earlier open-source efforts.

Training details reinforce the “practical open model” narrative. The fine-tuning uses 70,000 ShareGPT conversation examples, and the project provides code and scripts intended to reproduce the training setup. The transcript also highlights that training costs are relatively low compared with earlier large-scale efforts: the reported figure for Vicuna’s 13B fine-tuning is around $300, contrasted with higher costs cited for Alpaca and other baselines. The project also includes guidance for fine-tuning on 8 GPUs, including parameters like batch size, learning rate, epochs, and token counts.

A key nuance is how evaluation differs across tasks. GPT-4-based judging is described as consistent and capable of producing detailed scoring rationales, but it’s also said to struggle with coding and math assessments—an important limitation when interpreting “quality” scores.

The live demo offers qualitative checks that mirror the benchmark story. For a philosophical prompt (“What is the meaning of life?”), both Vicuna and ChatGPT produce coherent, thoughtful answers. When prompted to roleplay as Dwight Schrute, Vicuna’s response is portrayed as more personal and character-specific, while ChatGPT’s answer is more direct. For a more sensitive question (“Who is hotter, Angela or Pam?”), both systems apply safety-style constraints: Vicuna provides a non-crude comparison and avoids forcing a binary choice, while ChatGPT either refuses or responds with a restrained, non-objectifying framing.

Overall, Vicuna positions itself as one of the strongest open-source chatbots in its class—especially among models derived from LLaMA—while still falling short of ChatGPT in the GPT-4 scoring hierarchy. The missing weights at the time of testing limit experimentation, but the promised release and the availability of training code keep the project firmly in the open-model spotlight.

Cornell Notes

Vicuna is a 13B-parameter open-source chatbot designed to approach ChatGPT quality by fine-tuning Meta’s LLaMA on conversation data from ShareGPT. The project reports that GPT-4 can score Vicuna responses at roughly 90% of ChatGPT quality, with Vicuna generally beating earlier open models like Alpaca and LLaMA, and landing close to Bard. Evaluation uses GPT-4 as a judge across helpfulness, relevance, accuracy, and detail, though GPT-4 is noted as weaker at judging coding and math. A web demo was available even before the model weights were released, and sample prompts show both strong general responses and safety-style refusals for sensitive requests.

What data and base model does Vicuna use to reach ChatGPT-like behavior?

Vicuna is built on Meta’s LLaMA and then fine-tuned using conversations collected from ShareGPT—user prompts paired with ChatGPT responses. The transcript also notes that the authors used “preliminary evaluation” with GPT-4 to assess performance during development.

How do the reported quality comparisons between Vicuna, ChatGPT, Bard, Alpaca, and LLaMA work?

The comparisons rely on GPT-4 acting as an evaluator. GPT-4 rates each model’s responses on helpfulness, relevance, accuracy, and detail, and the results are presented as a relative ranking where ChatGPT is treated as the top reference. Vicuna is described as close to ChatGPT and ahead of Alpaca and LLaMA, with Bard near the top as well.

Why is GPT-4-as-judge both useful and limited?

It’s useful because it can produce relatively consistent scores and even detailed explanations for those scores. The limitation noted in the transcript is that GPT-4 is not very good at judging coding and math tasks, so benchmark conclusions may be less reliable for those categories.

What does the transcript’s qualitative demo suggest about Vicuna’s style versus ChatGPT?

For general prompts like “meaning of life,” both systems respond well. In a roleplay prompt as Dwight Schrute, Vicuna’s answer is portrayed as more personal and character-aligned, while ChatGPT’s response is more straightforward. For sensitive prompts (e.g., “who is hotter”), both systems apply restrictions—Vicuna avoids crude comparisons and refuses to force a binary choice.

What practical resources does the project provide even before weights are released?

The transcript says the project provides a web demo and training code/scripts, including fine-tuning guidance for running on 8 GPUs. It also lists training-related parameters such as batch size, learning rate, epochs, and token counts, plus a reproducibility script.

Review Questions

How does using GPT-4 as an evaluator shape confidence in Vicuna’s reported “90% of ChatGPT” quality claim?
What evidence from the demo suggests Vicuna’s safety behavior differs from or matches ChatGPT’s handling of sensitive prompts?
Why might benchmark results based on GPT-4 scoring be less trustworthy for coding and math tasks?

Key Points

1
Vicuna is a 13B-parameter chatbot fine-tuned from Meta’s LLaMA using ShareGPT conversation data.
2
The project reports roughly “90% of ChatGPT” quality based on GPT-4 scoring across helpfulness, relevance, accuracy, and detail.
3
In GPT-4-based rankings, Vicuna generally beats open baselines like Alpaca and LLaMA and sits close to Bard, slightly below ChatGPT.
4
GPT-4-as-judge can be consistent and detailed, but it’s also described as weak at evaluating coding and math tasks.
5
Training is presented as relatively affordable for the 13B fine-tuning stage (about $300 in the transcript) and includes reproducibility code.
6
At the time of testing, Vicuna model weights were not publicly available due to legal concerns, but a web demo enabled direct comparisons.
7
Demo prompts show strong general responses and safety-style constraints on sensitive, objectifying questions.

Highlights

Vicuna’s reported performance hinges on GPT-4 grading its responses, with Vicuna described as reaching about 90% of ChatGPT quality.

The evaluation framework scores helpfulness, relevance, accuracy, and detail, and it’s paired with a warning that GPT-4 struggles on coding/math judgment.

Qualitative tests show Vicuna can roleplay with more personal, character-specific tone, while both systems restrict crude comparisons in sensitive prompts.

Even without released weights at the time, the project’s web demo plus provided fine-tuning scripts let users experiment with behavior.

Topics

Vicuna
LLaMA Fine-Tuning
ShareGPT Data
GPT-4 Evaluation
Open-Source Chatbots

Mentioned

GPT-4
LLaMA