Get AI summaries of any video or article — Sign up free
How to make a custom dataset like Alpaca7B thumbnail

How to make a custom dataset like Alpaca7B

Sam Witteveen·
4 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Start with a small set of human-written seed instructions, then expand them using GPT-3 to reach tens of thousands of examples.

Briefing

A practical path to building an “Alpaca-style” instruction dataset is to start with a small set of human-written seed tasks, then use GPT-3 to expand them into tens of thousands of instruction–response examples—while filtering out unwanted language and near-duplicates. The key takeaway is that dataset quality and usefulness depend less on writing thousands of prompts by hand and more on controlling what GPT generates, how it’s filtered, and how repetitive the results become.

The approach begins with a reference design: Stanford’s released dataset starts from 175 human-written instruction tasks spanning common writing and transformation behaviors (e.g., generate a list, generate a sentence, generate a story, rewrite a sentence, create a “write,” explain). Those 175 seeds are then fed into GPT-3 to produce 52,000 examples, turning a manageable human effort (hours to a day, or faster with help) into a large training set.

To replicate the method, the workflow uses code that (1) formats prompts into the instruction/input/output structure, (2) calls GPT-3 to generate responses, and (3) applies filters. The transcript highlights a blacklist mechanism for excluding certain words or language patterns—useful when building a dataset for a specific topic or compliance needs. It also notes scoring similarity between generated instructions, which can help remove examples that are almost the same, reducing redundancy.

The dataset format itself is straightforward: many entries contain an “instruction” and an “output” only, while others include an additional “input” field. Examples shown include tasks like proposing an ethical solution to data privacy (instruction plus output) and identifying an incorrect word (instruction plus input plus output). The purpose isn’t to embed deep domain knowledge; it’s to teach the model how to follow instructions and produce the right kind of response.

From there, the transcript shifts to a hands-on customization. A small new seed set is created for customer service use cases—refund policy explanations, troubleshooting password changes, lost package inquiries, and similar support questions. The process runs a “generate instruction” function that expands each seed into multiple new tasks, using batch calls to speed up generation on multi-core hardware. Even with a modest target, filtering kicks in: one run generated 32 candidate instructions but kept only 29, implying some were rejected by the blacklist or other constraints.

After generation, the results can be reviewed directly, and similarity metrics can be computed to understand how repetitive the dataset is. The transcript also demonstrates why filtering matters: some generated items drift away from the intended domain (e.g., a cover-letter request appears even though the goal is customer-service question answering). Similarity checks can help flag such out-of-scope examples.

Finally, the generated instruction–response pairs become training data for an instruction-tuned model in the “custom alpaca” style. The transcript emphasizes scaling: small experiments (like 29 examples) are mainly for validation, while higher-quality datasets would ramp toward 50,000 or 100,000 examples. It also raises a design choice for chat datasets—whether to include simple single-turn Q&A or multi-step back-and-forth—because that determines what conversational behavior the model learns.

Cornell Notes

The dataset-building method starts with a small set of human-written seed instructions (175 in the reference design) and uses GPT-3 to expand them into a much larger instruction–response dataset (52,000 examples). A key part of the workflow is filtering: blacklist-based rejection of unwanted language and similarity scoring to reduce near-duplicate instructions. The transcript then demonstrates creating a custom customer-service dataset by writing a small set of support-related seed tasks (refunds, password resets, lost packages) and generating many more instructions from them, keeping only the filtered results. The resulting instruction/input/output pairs can then be used to fine-tune an Alpaca-style model for a specific niche or domain.

How does the reference dataset turn 175 human-written tasks into 52,000 training examples?

It takes the 175 seed instructions and feeds them into GPT-3 to generate additional instruction–response examples. The transcript frames this as a practical tradeoff: writing 175 seeds by hand is manageable (hours to a day), while GPT-3 expansion produces the scale needed for instruction fine-tuning.

What does the instruction dataset structure look like in practice?

Entries typically use an instruction field and an output field, with many also including an input field. Examples include tasks like proposing an ethical solution to data privacy (instruction → output) and identifying an incorrect word (instruction + input → output). This structure guides the model to follow instructions and produce the expected response format.

Why are blacklist filters and similarity scoring important when generating new instructions?

Blacklist filters remove generated content containing unwanted words or language patterns, which is especially useful for topic- or policy-specific datasets. Similarity scoring helps detect and remove instructions that are almost the same, reducing redundancy and improving the diversity of training examples.

How does the transcript’s customer-service dataset differ from the general seed set?

The new seeds focus on support workflows such as refund policy explanation, troubleshooting password changes, and lost package inquiries. The goal is instruction-following for customer-service style question answering, so the generated dataset should mostly contain support-related Q&A rather than unrelated tasks like writing code or general writing prompts.

What happens when generation is scaled up, and why does filtering reduce the final count?

Even when the generator targets a number of outputs per seed, filtering can discard some candidates. In the example run, 32 instructions were generated but only 29 were kept, suggesting some were rejected due to blacklist rules or similarity/quality constraints. Scaling to 50,000–100,000 examples would increase coverage, but only if filtering keeps the dataset on-domain and non-redundant.

Review Questions

  1. What are the three main components of the dataset generation workflow (formatting, GPT-3 generation, and filtering), and what does each contribute?
  2. How would you use similarity scoring to prevent your custom dataset from becoming repetitive or drifting off-domain?
  3. When building a chat dataset, what difference does it make to include multi-step back-and-forth versus single-turn question/answer pairs?

Key Points

  1. 1

    Start with a small set of human-written seed instructions, then expand them using GPT-3 to reach tens of thousands of examples.

  2. 2

    Use an instruction/input/output schema so the model learns the expected response pattern.

  3. 3

    Apply blacklist-based filtering to exclude unwanted words or categories of content.

  4. 4

    Compute similarity scores to remove near-duplicate instructions and improve dataset diversity.

  5. 5

    Review generated outputs for domain drift; similarity metrics can help flag out-of-scope examples.

  6. 6

    When scaling up, plan for filtering losses and aim for large counts (e.g., 50,000–100,000) to improve training quality.

  7. 7

    Decide whether the dataset should be single-turn Q&A or multi-step conversation, since that shapes learned conversational behavior.

Highlights

Stanford’s reference pipeline uses 175 seed tasks and GPT-3 to generate 52,000 instruction–response examples, showing how small human effort can produce large training data.
Blacklist filtering and similarity scoring are central to keeping generated datasets on-topic and non-redundant.
A customer-service dataset can be built by seeding with refund, password reset, and lost-package prompts, then generating and filtering many more instruction pairs.
Scaling from a tiny test set (29 kept examples) to 50,000–100,000 examples is the difference between a demo and a useful fine-tuning dataset.
Including multi-step dialogue examples changes what the model learns compared with single-turn question answering.

Topics

  • Instruction Fine-Tuning
  • Dataset Generation
  • GPT-3 Prompt Expansion
  • Filtering and Deduplication
  • Customer Support Datasets

Mentioned