How to make a custom dataset like Alpaca7B
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Start with a small set of human-written seed instructions, then expand them using GPT-3 to reach tens of thousands of examples.
Briefing
A practical path to building an “Alpaca-style” instruction dataset is to start with a small set of human-written seed tasks, then use GPT-3 to expand them into tens of thousands of instruction–response examples—while filtering out unwanted language and near-duplicates. The key takeaway is that dataset quality and usefulness depend less on writing thousands of prompts by hand and more on controlling what GPT generates, how it’s filtered, and how repetitive the results become.
The approach begins with a reference design: Stanford’s released dataset starts from 175 human-written instruction tasks spanning common writing and transformation behaviors (e.g., generate a list, generate a sentence, generate a story, rewrite a sentence, create a “write,” explain). Those 175 seeds are then fed into GPT-3 to produce 52,000 examples, turning a manageable human effort (hours to a day, or faster with help) into a large training set.
To replicate the method, the workflow uses code that (1) formats prompts into the instruction/input/output structure, (2) calls GPT-3 to generate responses, and (3) applies filters. The transcript highlights a blacklist mechanism for excluding certain words or language patterns—useful when building a dataset for a specific topic or compliance needs. It also notes scoring similarity between generated instructions, which can help remove examples that are almost the same, reducing redundancy.
The dataset format itself is straightforward: many entries contain an “instruction” and an “output” only, while others include an additional “input” field. Examples shown include tasks like proposing an ethical solution to data privacy (instruction plus output) and identifying an incorrect word (instruction plus input plus output). The purpose isn’t to embed deep domain knowledge; it’s to teach the model how to follow instructions and produce the right kind of response.
From there, the transcript shifts to a hands-on customization. A small new seed set is created for customer service use cases—refund policy explanations, troubleshooting password changes, lost package inquiries, and similar support questions. The process runs a “generate instruction” function that expands each seed into multiple new tasks, using batch calls to speed up generation on multi-core hardware. Even with a modest target, filtering kicks in: one run generated 32 candidate instructions but kept only 29, implying some were rejected by the blacklist or other constraints.
After generation, the results can be reviewed directly, and similarity metrics can be computed to understand how repetitive the dataset is. The transcript also demonstrates why filtering matters: some generated items drift away from the intended domain (e.g., a cover-letter request appears even though the goal is customer-service question answering). Similarity checks can help flag such out-of-scope examples.
Finally, the generated instruction–response pairs become training data for an instruction-tuned model in the “custom alpaca” style. The transcript emphasizes scaling: small experiments (like 29 examples) are mainly for validation, while higher-quality datasets would ramp toward 50,000 or 100,000 examples. It also raises a design choice for chat datasets—whether to include simple single-turn Q&A or multi-step back-and-forth—because that determines what conversational behavior the model learns.
Cornell Notes
The dataset-building method starts with a small set of human-written seed instructions (175 in the reference design) and uses GPT-3 to expand them into a much larger instruction–response dataset (52,000 examples). A key part of the workflow is filtering: blacklist-based rejection of unwanted language and similarity scoring to reduce near-duplicate instructions. The transcript then demonstrates creating a custom customer-service dataset by writing a small set of support-related seed tasks (refunds, password resets, lost packages) and generating many more instructions from them, keeping only the filtered results. The resulting instruction/input/output pairs can then be used to fine-tune an Alpaca-style model for a specific niche or domain.
How does the reference dataset turn 175 human-written tasks into 52,000 training examples?
What does the instruction dataset structure look like in practice?
Why are blacklist filters and similarity scoring important when generating new instructions?
How does the transcript’s customer-service dataset differ from the general seed set?
What happens when generation is scaled up, and why does filtering reduce the final count?
Review Questions
- What are the three main components of the dataset generation workflow (formatting, GPT-3 generation, and filtering), and what does each contribute?
- How would you use similarity scoring to prevent your custom dataset from becoming repetitive or drifting off-domain?
- When building a chat dataset, what difference does it make to include multi-step back-and-forth versus single-turn question/answer pairs?
Key Points
- 1
Start with a small set of human-written seed instructions, then expand them using GPT-3 to reach tens of thousands of examples.
- 2
Use an instruction/input/output schema so the model learns the expected response pattern.
- 3
Apply blacklist-based filtering to exclude unwanted words or categories of content.
- 4
Compute similarity scores to remove near-duplicate instructions and improve dataset diversity.
- 5
Review generated outputs for domain drift; similarity metrics can help flag out-of-scope examples.
- 6
When scaling up, plan for filtering losses and aim for large counts (e.g., 50,000–100,000) to improve training quality.
- 7
Decide whether the dataset should be single-turn Q&A or multi-step conversation, since that shapes learned conversational behavior.