Build Hour: AgentKit
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Agent Kit replaces brittle, code-heavy agent orchestration with versioned visual workflows, secure connector management, and built-in evaluation tooling.
Briefing
Agent Kit is positioned as a faster, safer way to build multi-step AI agents—shifting agent development from hand-coded orchestration and fragile prompt tinkering to a versioned, visual workflow system with built-in evaluation, automated prompt optimization, and customizable UI deployment.
For months, building agents required heavy engineering: orchestration logic lived in code, updates could introduce breaking changes, secure tool connections demanded custom work, and evaluation often meant manually exporting data into separate tooling and stitching results together. Agent Kit targets those pain points directly. Workflows can be assembled visually in a workflow builder, and they’re versioned to avoid breaking changes during iteration. A connector registry (“admin center”) is used to connect data and tools securely. The platform also includes evaluation tooling with third-party model support, plus an automated prompt optimization feature designed to replace slow trial-and-error prompt rewriting.
The stack ties these pieces together end-to-end. Agent Builder lets teams choose which models to deploy, connect tools, write and optimize prompts, add guardrails for unexpected inputs, and then deploy the resulting workflow to Chatkit. From there, agents can be optimized at scale using real-world traces and evaluation against production-like data.
A live go-to-market assistant demo made the workflow concrete. The build starts with a “question classifier” agent that forces structured output—routing each incoming request into one of several specialized paths. A state variable captures the classification result, and conditional branching sends the request to the appropriate downstream agent. One branch uses an MCP server to query data in DataBricks, including authentication via a personal access token and tool selection to constrain what the model can do. Another branch performs information gathering via web search, extracting fields like company legal name, employee count, description, annual revenue, and geography into a structured schema.
Email generation and lead enhancement are handled by separate agents, with retrieval from vector stores (e.g., PDFs containing email-writing SOPs and campaign details) to ground outputs in campaign-specific material. The demo also highlights richer outputs: Agent Builder supports widgets rather than only text or JSON, enabling UI components that can be rendered in Chatkit. Widgets can be generated or customized through natural language, and the system can even propagate JavaScript-driven UI behavior into a website context.
The hardest part—trust—gets its own focus in the evaluation walkthrough. Henry’s eval demo shows how to test individual nodes first, since the weakest component can break the whole system. An “evaluate” button opens a dataset-driven eval UI where teams run generation, add human annotations (thumbs up/down and free-text feedback), and attach graders that enforce rubric-like requirements (for example, requiring explicit buy/sell/hold recommendations and competitor comparisons). Automated prompt optimization then rewrites prompts based on those annotations and grader outcomes.
Finally, end-to-end confidence comes from trace grading at scale. Instead of manually inspecting traces, teams define grader rubrics over traces and run “grade all” to surface problematic spans across many runs. Best practices emphasized starting evals early with small but high-quality datasets (roughly 10–20 examples to begin), using real human data rather than purely synthetic inputs, and investing time in annotation and aligning LLM graders with what “good” means.
The session closes with examples of adoption and impact, including claims of faster prototyping and measurable efficiency gains, plus Q&A clarifying how Agent Kit relates to the Agents SDK, how MCP servers differ when built-in vs out-of-the-box, and when to use classifier-and-branching architectures to avoid tool overload.
Cornell Notes
Agent Kit aims to make agent development faster and more reliable by combining visual workflow building, secure tool/data connections, built-in evaluation, automated prompt optimization, and deployable UI via Chatkit. Workflows are versioned to reduce breaking changes, and connector registry tooling helps manage safe integrations. A demo built a go-to-market assistant using a classifier agent with structured outputs to route requests into specialized sub-agents for DataBricks querying, web research, email drafting, and lead enhancement—grounded with vector-store documents. Evaluation is treated as a first-class workflow: teams test individual nodes with datasets, human annotations, and rubric-style graders, then grade traces at scale to find failure patterns. The approach matters because it turns “agent quality” into measurable, iterative improvements rather than manual inspection and guesswork.
Why does Agent Kit emphasize versioned, visual workflow building instead of code-only orchestration?
How does the go-to-market assistant demo route work to different sub-agents?
What role do MCP servers play in connecting tools like DataBricks?
How does the evaluation workflow turn agent behavior into something measurable?
Why grade traces at scale instead of manually inspecting every run?
What guidance was given on eval dataset size and composition?
Review Questions
- What structured-output mechanism enables conditional branching in the demo workflow, and how is the result reused later in the workflow?
- Describe the sequence of steps used to evaluate a single node: dataset setup, generation, annotations, graders, and automated prompt optimization.
- What is the difference between node-level evaluation and trace grading, and why does each step matter for building trustworthy multi-agent systems?
Key Points
- 1
Agent Kit replaces brittle, code-heavy agent orchestration with versioned visual workflows, secure connector management, and built-in evaluation tooling.
- 2
Workflows can be assembled from specialized sub-agents using structured outputs and stateful variables to route requests reliably.
- 3
MCP servers standardize tool access; authentication and allowed functions can be configured to keep model actions constrained and safer.
- 4
Grounding outputs with vector-store documents (e.g., email SOPs and campaign PDFs) helps generate emails that match campaign context rather than generic text.
- 5
Evaluation is integrated into the build process: teams test nodes with datasets, human annotations, and rubric graders before scaling to trace grading.
- 6
Automated prompt optimization uses eval signals (annotations and grader outputs) to rewrite prompts, reducing manual prompt engineering cycles.
- 7
For evals, start early with small but high-quality datasets (roughly 10–20 examples) and prioritize real user data over large synthetic sets.