Build Hour: Agentic Tool Calling

TL;DR

Agentic tool calling pairs result-focused reasoning with tool execution so models can pursue goals across long horizons and recover from tool failures.

Briefing Cornell Notes

Briefing

Agentic tool calling is being positioned as the bridge between “reasoning” and “doing,” enabling goal-driven models that can plan across long horizons, call tools repeatedly, recover from failures, and stay consistent through tens or hundreds of function calls. The practical payoff is a shift from chat-only interactions toward long-horizon “tasks” that have an end state, run with supporting infrastructure, and can be evaluated by outcomes rather than turn-by-turn correctness.

The session ties that concept to a broader 2025 push toward agents, highlighting new capabilities such as deep research 03 and Codex, plus the responses API’s growing ecosystem. A key thread is how reasoning training and tool execution combine: models are trained to optimize for correct results (not step-by-step scripts), then tool calling supplies the means to fetch information and take actions. When those two pieces work together, the system becomes “resourceful” and “robust to recovery,” with the ability to course-correct after tool failures and continue toward the goal across long sequences.

From there, the talk reframes “tasks” as a new abstraction layer above chat. A task isn’t just a prompt; it’s an agent definition (what it should do), infrastructure (how it runs, manages state, retries, and parallelism), a product layer (how users interact, how progress is surfaced, and what context is provided), and evaluation (how success is graded). Goal specification is emphasized as different from step-by-step prompting: instead of dictating the path, the system is given the desired end state, along with the tools needed to reach it.

A hands-on coding segment demonstrates an agentic task system for a customer-service backlog. The implementation starts with the agents SDK to wrap the tool-calling loop, then defines mock functions such as retrieving user data, fetching order details, and initiating refunds. The agent is instructed to “get all the context you need up front” and then complete without further back-and-forth, illustrating how end-state instructions can reduce unnecessary interaction.

The product/infrastructure integration is where the architecture becomes concrete. A simple “foreground” approach streams events over an SSE connection while the client stays connected. For real-world long-running work, the session then builds a background-task pattern: a backend task endpoint creates a task and returns a task ID, while a separate SSE events endpoint streams task updates to the frontend. An async worker runs the agent with streaming, publishes response events to the event queue, and updates task status until completion.

To make progress visible without exposing raw internal tool calls, the demo adds a to-do mechanism. The model receives functions that can update a task’s to-dos; the frontend renders those to-dos as progress indicators, while the underlying function-call details are filtered out. This creates a user experience that feels transparent and “magical” without building a separate monitoring UI.

The Q&A extends the theme with practical guidance: use Python for strict sequential/conditional tool chains when needed; manage memory via the model’s context plus external stores like vector search; keep tool counts to a reasonable range (roughly under 20) and use handoffs when tool sets grow; combine hosted tools (including MCP) with custom functions; and note that the responses API supports MCP, enabling long background runs when everything can be handled remotely. The session closes by pointing to shared repos, a practical guide for building agents, and an upcoming build hour focused on image gen.

Cornell Notes

Agentic tool calling combines result-focused reasoning with tool execution so models can pursue a goal, call tools repeatedly, and recover from failures over long horizons. The session frames “tasks” as the key abstraction for moving beyond chat: a task includes an agent (goal + tools), infrastructure (state, retries, parallelism, runtime), a product layer (user interaction and progress visibility), and evaluation (outcome grading). A live build shows how to implement tasks using the agents SDK and the responses API, then integrate them with a backend that streams events via SSE. For user trust, progress can be surfaced through model-updated to-dos while internal tool-call details are filtered from the UI. The Q&A adds practical rules of thumb for orchestration, memory, tool counts, and MCP usage.

What makes “agentic tool calling” different from basic tool calling?

It’s the combination of reasoning and tools. Reasoning is trained toward correct end results (not scripted steps), and tools provide the ability to fetch information and take actions. Together, the model becomes goal-oriented and can handle long-horizon work with many tool calls, course-correct after tool failures, and remain consistent across extended sequences.

Why does the session treat “tasks” as a new primitive rather than just longer prompts?

Tasks add structure around an end state and the full lifecycle needed to reach it. The framework includes: (1) an agent definition (goal specification + tools), (2) infrastructure (how the task runs, manages state, retries, parallelism, and runtime environment), (3) a product layer (how users interact, what context is provided, and how progress is shown), and (4) evaluation (grading end results rather than turn-by-turn chat behavior).

How does the demo architecture support long-running work without keeping the user connected?

It uses a background-task pattern. A task endpoint creates a task and returns a task ID, while a separate SSE events endpoint stays open to stream updates. An async worker runs the agent in streaming mode, publishes response events to an event queue, and updates the task status until completion. The frontend can disconnect and reconnect while still receiving updates.

How does the demo surface progress to users without exposing internal tool calls?

It adds a to-dos field to the task and provides the model functions to add and check off to-dos using the task context. The frontend renders these to-dos as progress indicators, while the UI filters out the underlying function-call events. This gives users insight into progress while keeping the interaction natural.

When should developers avoid relying on the model for strict sequential/conditional tool chains?

If strict sequencing or conditional logic is required, the guidance is to implement it in code (Python). For example, if a workflow must call three functions in a specific order, wrapping those calls inside a single Python function is more reliable and can reduce latency and cost compared with hoping the model orchestrates the sequence correctly.

What role does MCP play in the responses API and background execution?

MCP support lets the responses API call remote MCP servers as tools. That enables “background mode” style long runs where the agent doesn’t need to wait for local function execution—remote tools (including things like file search and image generation) can be handled through MCP, and the system can check back later for completion.

Review Questions

How does goal specification for tasks differ from step-by-step prompting, and why does that matter for long-horizon tool use?
Describe the difference between the demo’s foreground streaming approach and its background task + SSE events architecture.
What are two strategies mentioned for managing memory in agents handling long-running tasks?

Key Points

1
Agentic tool calling pairs result-focused reasoning with tool execution so models can pursue goals across long horizons and recover from tool failures.
2
“Tasks” are treated as an end-to-end abstraction: agent goal + tools, infrastructure for running and state management, product UX for progress/context, and evaluation focused on outcomes.
3
A practical task system can be built with the agents SDK and the responses API, using streaming events to keep the UI responsive.
4
For long-running work, a background architecture (task creation + SSE event streaming) lets clients disconnect and reconnect while tasks continue.
5
User trust can be improved by rendering model-updated to-dos as progress, while filtering out raw internal tool-call details.
6
Strict sequential/conditional tool workflows are better expressed in code (Python) rather than relying entirely on the model’s orchestration.
7
The responses API supports MCP, enabling remote tool calls and background execution when local functions aren’t needed.

Highlights

Agentic tool calling is framed as reasoning trained for correct outcomes plus tool access, producing goal-driven behavior that stays consistent over many tool calls.

The demo’s background task architecture separates task creation from event streaming: one endpoint returns a task ID, another SSE endpoint continuously pushes updates.

Progress can be made “magical” by letting the model update a task’s to-dos through tool functions, with the UI rendering those to-dos as user-facing status.

MCP support in the responses API enables long background runs where remote tools handle most work without blocking for local function execution.

Topics

Agentic Tool Calling
Long-Horizon Tasks
Agents SDK
SSE Streaming
MCP Integration

Mentioned

Sarah Urbonus
Alain
MCP
SSE
RL