Build Hour: AgentKit

TL;DR

Agent Kit replaces brittle, code-heavy agent orchestration with versioned visual workflows, secure connector management, and built-in evaluation tooling.

Briefing Cornell Notes

Briefing

Agent Kit is positioned as a faster, safer way to build multi-step AI agents—shifting agent development from hand-coded orchestration and fragile prompt tinkering to a versioned, visual workflow system with built-in evaluation, automated prompt optimization, and customizable UI deployment.

For months, building agents required heavy engineering: orchestration logic lived in code, updates could introduce breaking changes, secure tool connections demanded custom work, and evaluation often meant manually exporting data into separate tooling and stitching results together. Agent Kit targets those pain points directly. Workflows can be assembled visually in a workflow builder, and they’re versioned to avoid breaking changes during iteration. A connector registry (“admin center”) is used to connect data and tools securely. The platform also includes evaluation tooling with third-party model support, plus an automated prompt optimization feature designed to replace slow trial-and-error prompt rewriting.

The stack ties these pieces together end-to-end. Agent Builder lets teams choose which models to deploy, connect tools, write and optimize prompts, add guardrails for unexpected inputs, and then deploy the resulting workflow to Chatkit. From there, agents can be optimized at scale using real-world traces and evaluation against production-like data.

A live go-to-market assistant demo made the workflow concrete. The build starts with a “question classifier” agent that forces structured output—routing each incoming request into one of several specialized paths. A state variable captures the classification result, and conditional branching sends the request to the appropriate downstream agent. One branch uses an MCP server to query data in DataBricks, including authentication via a personal access token and tool selection to constrain what the model can do. Another branch performs information gathering via web search, extracting fields like company legal name, employee count, description, annual revenue, and geography into a structured schema.

Email generation and lead enhancement are handled by separate agents, with retrieval from vector stores (e.g., PDFs containing email-writing SOPs and campaign details) to ground outputs in campaign-specific material. The demo also highlights richer outputs: Agent Builder supports widgets rather than only text or JSON, enabling UI components that can be rendered in Chatkit. Widgets can be generated or customized through natural language, and the system can even propagate JavaScript-driven UI behavior into a website context.

The hardest part—trust—gets its own focus in the evaluation walkthrough. Henry’s eval demo shows how to test individual nodes first, since the weakest component can break the whole system. An “evaluate” button opens a dataset-driven eval UI where teams run generation, add human annotations (thumbs up/down and free-text feedback), and attach graders that enforce rubric-like requirements (for example, requiring explicit buy/sell/hold recommendations and competitor comparisons). Automated prompt optimization then rewrites prompts based on those annotations and grader outcomes.

Finally, end-to-end confidence comes from trace grading at scale. Instead of manually inspecting traces, teams define grader rubrics over traces and run “grade all” to surface problematic spans across many runs. Best practices emphasized starting evals early with small but high-quality datasets (roughly 10–20 examples to begin), using real human data rather than purely synthetic inputs, and investing time in annotation and aligning LLM graders with what “good” means.

The session closes with examples of adoption and impact, including claims of faster prototyping and measurable efficiency gains, plus Q&A clarifying how Agent Kit relates to the Agents SDK, how MCP servers differ when built-in vs out-of-the-box, and when to use classifier-and-branching architectures to avoid tool overload.

Cornell Notes

Agent Kit aims to make agent development faster and more reliable by combining visual workflow building, secure tool/data connections, built-in evaluation, automated prompt optimization, and deployable UI via Chatkit. Workflows are versioned to reduce breaking changes, and connector registry tooling helps manage safe integrations. A demo built a go-to-market assistant using a classifier agent with structured outputs to route requests into specialized sub-agents for DataBricks querying, web research, email drafting, and lead enhancement—grounded with vector-store documents. Evaluation is treated as a first-class workflow: teams test individual nodes with datasets, human annotations, and rubric-style graders, then grade traces at scale to find failure patterns. The approach matters because it turns “agent quality” into measurable, iterative improvements rather than manual inspection and guesswork.

Why does Agent Kit emphasize versioned, visual workflow building instead of code-only orchestration?

The platform targets the common failure mode of agent development: orchestration complexity and fragile iteration. Visual workflows are versioned so updates don’t introduce breaking changes as easily as code edits can. That reduces the engineering overhead of maintaining multi-step agent logic and makes it easier to iterate on routing, tool use, and prompt behavior without repeatedly reworking the entire system.

How does the go-to-market assistant demo route work to different sub-agents?

It starts with a “question classifier” agent that outputs a structured category (an enum) such as email, data, or qualification. That output is saved into a workflow state variable and used for conditional branching. If the category equals “data,” the workflow routes to a DataBricks-querying agent; otherwise it routes to information gathering and then to email/lead enhancement agents.

What role do MCP servers play in connecting tools like DataBricks?

MCP servers provide a standardized way for an agent to access external capabilities. In the demo, a DataBricks MCP server is added to the workflow with authentication using a personal access token. The builder also lets teams select which functions are allowed (e.g., a fetch tool) so the model isn’t overwhelmed by too many actions. The agent can request consent before taking actions, and MCP supports both read and write patterns depending on the server.

How does the evaluation workflow turn agent behavior into something measurable?

Evaluation begins at the node level. An “evaluate” action opens a dataset UI where teams run generation using the node’s prompt, tools, and model, then apply annotations (thumbs up/down and free-text feedback). Graders enforce rubric requirements—such as requiring explicit buy/sell/hold recommendations and competitor comparisons. Automated prompt optimization then rewrites prompts based on the annotated outcomes and grader signals.

Why grade traces at scale instead of manually inspecting every run?

Traces can be hard to interpret one-by-one, and manual debugging doesn’t scale. The trace grading approach defines grader rubrics over spans and then runs “grade all” across a scoped set of traces. This surfaces which traces and spans fail specific criteria, letting teams focus on the problematic parts of the workflow rather than reading every execution path.

What guidance was given on eval dataset size and composition?

Starting early matters more than waiting for large datasets. The recommendation was that 10–20 high-quality examples can go a long way for initial testing, with more data needed as production approaches. Quality beats quantity: 50 representative, high-quality cases with aligned graders can outperform a large synthetic dataset generated by LLMs. Using real human data helps capture real-world messiness that synthetic inputs often miss.

Review Questions

What structured-output mechanism enables conditional branching in the demo workflow, and how is the result reused later in the workflow?
Describe the sequence of steps used to evaluate a single node: dataset setup, generation, annotations, graders, and automated prompt optimization.
What is the difference between node-level evaluation and trace grading, and why does each step matter for building trustworthy multi-agent systems?

Key Points

1
Agent Kit replaces brittle, code-heavy agent orchestration with versioned visual workflows, secure connector management, and built-in evaluation tooling.
2
Workflows can be assembled from specialized sub-agents using structured outputs and stateful variables to route requests reliably.
3
MCP servers standardize tool access; authentication and allowed functions can be configured to keep model actions constrained and safer.
4
Grounding outputs with vector-store documents (e.g., email SOPs and campaign PDFs) helps generate emails that match campaign context rather than generic text.
5
Evaluation is integrated into the build process: teams test nodes with datasets, human annotations, and rubric graders before scaling to trace grading.
6
Automated prompt optimization uses eval signals (annotations and grader outputs) to rewrite prompts, reducing manual prompt engineering cycles.
7
For evals, start early with small but high-quality datasets (roughly 10–20 examples) and prioritize real user data over large synthetic sets.

Highlights

Agent Kit’s versioned visual workflow builder is designed to prevent breaking changes that often come from prompt and orchestration edits.

A classifier agent with enum-based structured output can deterministically route requests into specialized branches—data analysis, research, email drafting, and lead qualification.

Trace grading turns messy execution logs into rubric-based, scalable assessments that pinpoint failing spans across many runs.

Automated prompt optimization rewrites prompts using eval annotations and grader outcomes, cutting down on repetitive manual prompt tuning.

Topics

Agent Kit Overview
Visual Agent Builder
MCP Tool Integration
Eval and Trace Grading
Chatkit Widgets

Mentioned

Tasha
Summer
Henry
Samarth
Mark
Ripling
Carlile
Bane
MCP
GPT
UI
API
SDK
GTM