Get AI summaries of any video or article — Sign up free
Anthropic's Latest Winner - Workbench thumbnail

Anthropic's Latest Winner - Workbench

Sam Witteveen·
4 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Anthropic’s Workbench update turns prompt creation into a full workflow that includes prompt testing, batch test generation, and benchmarking.

Briefing

Anthropic has overhauled its developer “Workbench” inside the Anthropic console, turning prompt building into a full testing and benchmarking workflow. Instead of treating prompts as static text, developers can now generate prompts from scratch, run them against targeted test cases, and assemble a versioned evaluation suite to measure how well different prompt designs handle context and produce reliable outputs—then export code for use in production.

The core shift is the addition of prompt generation plus prompt testing in one place. A developer can describe a task in plain language—such as drafting a response to a YouTube comment that classifies whether the comment is toxic and whether it deserves a reply—and then click “generate prompt.” Claude produces a detailed, ready-to-run prompt that includes a system-style instruction block, the input comment, and step-by-step decision criteria. In this example, the generated instructions explicitly guide toxicity detection (offensive language, personal attacks, hate speech, threats, harassment, excessive negativity, trolling) and then determine whether a reply is warranted.

From there, Workbench moves into evaluation mode. The console prompts the user to supply the specific YouTube comment, runs the prompt, and returns a classification result along with reasoning and a final label (e.g., not toxic, moderate toxicity). Developers can generate multiple test cases automatically, run them, and then refine the prompt based on observed weaknesses. The workflow supports editing the prompt directly and re-running evaluations to see how changes affect outcomes.

A key feature is versioning. Workbench tracks prompt versions and response versions, enabling side-by-side comparisons—such as a second attempt where the developer slightly rearranges instructions (for example, moving content between user and system sections). The console can then score results across versions, letting users label performance (e.g., “excellent” or “fair”) and quickly identify which prompt variant performs better.

Workbench also streamlines what used to be a spreadsheet-and-manual-process problem. Previously, teams often exported test cases and relied on human scoring methods like MOS (mean opinion scoring) to compare prompt quality. Now, the evaluation and scoring live in the same dashboard, with the ability to import test cases from CSV and to generate additional cases on demand.

Once the evaluation suite is complete, developers can click “code” to export runnable code that reproduces the prompt set and evaluation configuration. They can also adjust runtime parameters such as maximum tokens and rerun the suite against different Anthropic models (the transcript mentions comparing Haiku and Sonnet). The result is a repeatable loop: generate prompts, test and edit them, benchmark versions, and export the workflow for trial or production deployments—directly inside Anthropic’s console.

Cornell Notes

Anthropic’s updated Workbench in the Anthropic console adds an end-to-end workflow for prompt engineering: generate prompts, run them on test cases, and benchmark multiple prompt versions. Developers can create a prompt from a task description (e.g., classifying YouTube comments for toxicity and reply-worthiness), then fill in inputs and see labeled outputs with reasoning. Workbench supports generating batches of test cases, importing cases from CSV, and versioning prompts and responses so teams can compare variants. Scoring and evaluation happen inside the dashboard, and the final setup can be exported as code for running in trial or production. The same evaluation suite can be rerun across different Anthropic models to compare performance.

What new capability does Anthropic’s Workbench add for developers beyond prompt generation?

Workbench adds a built-in evaluation loop: prompts can be generated from scratch, then tested against specific inputs and multiple generated test cases. The console returns classification outputs (e.g., toxic vs. not toxic, and whether a reply is deserved) along with reasoning, and it supports iterative editing and re-running so prompt weaknesses can be identified and corrected.

How does the transcript’s YouTube-comment example illustrate prompt structure in Workbench?

After describing the task, Claude generates a long, ready-to-run prompt that includes a system-style instruction block (an assistant role for analyzing YouTube comments), the actual comment inserted as input, and step-by-step criteria. The criteria explicitly cover toxicity signals such as offensive language, personal attacks, hate speech, threats of harassment, excessive negativity, and trolling, followed by a decision about whether the comment deserves a reply.

Why does versioning matter in Workbench evaluations?

Versioning lets developers compare prompt variants and track changes over time. The transcript describes creating a second attempt by slightly modifying the prompt (for example, moving instruction content between user and system sections), then running the first prompt against the second and scoring results. Because prompt and response versions are tracked, teams can determine which design changes improve outcomes rather than relying on one-off tests.

How does Workbench replace or reduce spreadsheet-based prompt scoring workflows?

Instead of exporting test cases and using manual scoring like MOS (mean opinion scoring) in spreadsheets, Workbench keeps test cases, runs, and scoring in one dashboard. Developers can generate or import test cases (including via CSV), score prompt versions directly in the interface, and then export code that reproduces the evaluation setup.

What does exporting code enable after completing prompt evaluations?

Once evaluations are scored, developers can click “code” to obtain runnable code that executes the prompt set and evaluation configuration. The transcript notes that parameters like maximum tokens can be adjusted, and the suite can be rerun against different models (e.g., Haiku vs. Sonnet) to compare performance under the same test conditions.

Review Questions

  1. How would you design a Workbench evaluation suite to compare two prompt variants for a customer-support classification task?
  2. What types of test cases would you include to ensure toxicity detection doesn’t miss edge cases like threats or harassment?
  3. After scoring prompt versions, what steps would you take to export and run the evaluation code in a trial environment?

Key Points

  1. 1

    Anthropic’s Workbench update turns prompt creation into a full workflow that includes prompt testing, batch test generation, and benchmarking.

  2. 2

    Developers can generate detailed prompts from task descriptions and then run them against specific inputs directly in the console.

  3. 3

    Workbench supports iterative refinement by letting users edit prompts and re-run evaluations to see how results change.

  4. 4

    Prompt and response versioning enables controlled comparisons between prompt variants and more reliable scoring.

  5. 5

    Evaluation suites can be built by generating test cases in the UI or importing them from CSV files.

  6. 6

    Scoring and evaluation happen inside the dashboard, reducing reliance on spreadsheets and manual MOS-style processes.

  7. 7

    Completed evaluations can be exported as code and rerun across different Anthropic models (such as Haiku and Sonnet) for consistent comparisons.

Highlights

Workbench adds a developer-grade evaluation suite: generate prompts, test them on inputs, and benchmark prompt versions in one place.
The YouTube toxicity example shows Workbench-generated prompts that include explicit step-by-step criteria for toxicity and reply-worthiness.
Versioning and scoring make prompt iteration measurable, not guesswork, and the results can be exported as runnable code.
The same evaluation setup can be rerun across models like Haiku and Sonnet to compare performance under identical test cases.

Topics

Mentioned