Anthropic's Latest Winner - Workbench
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Anthropic’s Workbench update turns prompt creation into a full workflow that includes prompt testing, batch test generation, and benchmarking.
Briefing
Anthropic has overhauled its developer “Workbench” inside the Anthropic console, turning prompt building into a full testing and benchmarking workflow. Instead of treating prompts as static text, developers can now generate prompts from scratch, run them against targeted test cases, and assemble a versioned evaluation suite to measure how well different prompt designs handle context and produce reliable outputs—then export code for use in production.
The core shift is the addition of prompt generation plus prompt testing in one place. A developer can describe a task in plain language—such as drafting a response to a YouTube comment that classifies whether the comment is toxic and whether it deserves a reply—and then click “generate prompt.” Claude produces a detailed, ready-to-run prompt that includes a system-style instruction block, the input comment, and step-by-step decision criteria. In this example, the generated instructions explicitly guide toxicity detection (offensive language, personal attacks, hate speech, threats, harassment, excessive negativity, trolling) and then determine whether a reply is warranted.
From there, Workbench moves into evaluation mode. The console prompts the user to supply the specific YouTube comment, runs the prompt, and returns a classification result along with reasoning and a final label (e.g., not toxic, moderate toxicity). Developers can generate multiple test cases automatically, run them, and then refine the prompt based on observed weaknesses. The workflow supports editing the prompt directly and re-running evaluations to see how changes affect outcomes.
A key feature is versioning. Workbench tracks prompt versions and response versions, enabling side-by-side comparisons—such as a second attempt where the developer slightly rearranges instructions (for example, moving content between user and system sections). The console can then score results across versions, letting users label performance (e.g., “excellent” or “fair”) and quickly identify which prompt variant performs better.
Workbench also streamlines what used to be a spreadsheet-and-manual-process problem. Previously, teams often exported test cases and relied on human scoring methods like MOS (mean opinion scoring) to compare prompt quality. Now, the evaluation and scoring live in the same dashboard, with the ability to import test cases from CSV and to generate additional cases on demand.
Once the evaluation suite is complete, developers can click “code” to export runnable code that reproduces the prompt set and evaluation configuration. They can also adjust runtime parameters such as maximum tokens and rerun the suite against different Anthropic models (the transcript mentions comparing Haiku and Sonnet). The result is a repeatable loop: generate prompts, test and edit them, benchmark versions, and export the workflow for trial or production deployments—directly inside Anthropic’s console.
Cornell Notes
Anthropic’s updated Workbench in the Anthropic console adds an end-to-end workflow for prompt engineering: generate prompts, run them on test cases, and benchmark multiple prompt versions. Developers can create a prompt from a task description (e.g., classifying YouTube comments for toxicity and reply-worthiness), then fill in inputs and see labeled outputs with reasoning. Workbench supports generating batches of test cases, importing cases from CSV, and versioning prompts and responses so teams can compare variants. Scoring and evaluation happen inside the dashboard, and the final setup can be exported as code for running in trial or production. The same evaluation suite can be rerun across different Anthropic models to compare performance.
What new capability does Anthropic’s Workbench add for developers beyond prompt generation?
How does the transcript’s YouTube-comment example illustrate prompt structure in Workbench?
Why does versioning matter in Workbench evaluations?
How does Workbench replace or reduce spreadsheet-based prompt scoring workflows?
What does exporting code enable after completing prompt evaluations?
Review Questions
- How would you design a Workbench evaluation suite to compare two prompt variants for a customer-support classification task?
- What types of test cases would you include to ensure toxicity detection doesn’t miss edge cases like threats or harassment?
- After scoring prompt versions, what steps would you take to export and run the evaluation code in a trial environment?
Key Points
- 1
Anthropic’s Workbench update turns prompt creation into a full workflow that includes prompt testing, batch test generation, and benchmarking.
- 2
Developers can generate detailed prompts from task descriptions and then run them against specific inputs directly in the console.
- 3
Workbench supports iterative refinement by letting users edit prompts and re-run evaluations to see how results change.
- 4
Prompt and response versioning enables controlled comparisons between prompt variants and more reliable scoring.
- 5
Evaluation suites can be built by generating test cases in the UI or importing them from CSV files.
- 6
Scoring and evaluation happen inside the dashboard, reducing reliance on spreadsheets and manual MOS-style processes.
- 7
Completed evaluations can be exported as code and rerun across different Anthropic models (such as Haiku and Sonnet) for consistent comparisons.