Harrison Chase - Agents Masterclass from LangChain Founder (LLM Bootcamp)

TL;DR

Agents treat the language model as a reasoning engine that iteratively selects tools, executes them, observes results, and adapts the next action based on those observations.

Briefing Cornell Notes

Briefing

Agent systems are built around a simple but consequential shift: use a language model as a reasoning engine that decides which tool to call next, then adapts its next step based on what the tools return. That flexibility matters because real tasks rarely follow a clean script—especially when answering questions requires multiple hops, when database queries can fail, or when the system must recover from wrong intermediate actions. Instead of hard-coding “do A then do B,” an agent chooses actions dynamically, guided by user input and the results of previous steps.

The practical backbone of this approach is tool usage plus iterative control. A typical agent loop takes a user query, asks the language model to select a tool and provide the tool input, executes the tool, records the tool’s observation, and feeds that observation back into the model. The loop continues until a stopping condition triggers—often when the model decides it has enough information, though hard-coded rules can also force an early return (for example, after a fixed number of steps without reaching a final answer). This design is meant to overcome core language-model limits: models may not know private data, may struggle with exact computation, and can hallucinate. Tools—search APIs, databases, and other external computation—act as the corrective layer.

The most influential prompting strategy for this tool-using loop is ReAct (“reasoning” plus “acting”). ReAct combines chain-of-thought-style reasoning with explicit tool calls, aiming to improve both decision-making and grounding in real information. The transcript contrasts three approaches using a multi-hop question over Wikipedia: a direct answer attempt fails; “let’s think step by step” improves reasoning but still lacks grounded tool access; and “action-only” can retrieve information but may lose the reasoning structure needed to integrate results. ReAct’s blend is presented as a way to get the best of both: stronger reasoning about what to do, paired with tool calls that fetch the missing facts.

Despite the promise, production reliability remains a major challenge. Agents often misuse tools when they shouldn’t, or fail to use the right tools when they should. Tool descriptions and instructions can help the model choose appropriately, but scaling to large tool catalogs creates context-length pressure—pushing teams toward tool retrieval (embedding-based selection of the most promising tools) and retrieval of relevant few-shot examples. Another reliability hurdle is turning the model’s tool-call text into executable code; structured outputs (often JSON) and modular output parsers—sometimes with retry-and-fix behavior—are used to reduce parsing failures.

Long-running agents introduce additional failure modes: they lose earlier objectives as prompts grow, they struggle with remembering long tool outputs, and they can drift off track. Common mitigations include re-stating the objective near each action, retrieving only the most relevant prior steps, and summarizing or truncating large API responses. For longer horizons, separating planning from execution is highlighted as a promising reliability tactic.

Finally, the transcript links agent reliability to evaluation and memory. Evaluation must measure not only the final answer but also the agent trajectory—whether actions were correct, inputs were valid, and the number of steps was efficient. Memory is treated as central to modern agentic systems: beyond keeping recent steps, newer work emphasizes personalization, long-term memory via vector stores, and reflection loops that update an internal “state of the world.” Recent projects—AutoGPT, Baby AGI, CAMEL, and Generative Agents—are described as pushing these ideas forward through longer objectives, simulation environments, and time/importance/relevance-weighted memory retrieval with periodic reflection.

Cornell Notes

Agents use a language model as a reasoning engine that selects tools, executes them, observes results, and iterates until a stopping condition. This dynamic loop is meant to fix language-model weaknesses like missing private data, poor exact computation, and hallucinated details by grounding decisions in external search, databases, and APIs. ReAct (“reasoning” + “acting”) is a key prompting strategy that combines structured reasoning with explicit tool calls, improving multi-hop question answering compared with direct answering or reasoning-only approaches. Reliability challenges persist in production: agents must learn when to use tools, how to format tool calls for execution, and how to remember objectives and prior steps during long runs. Memory, reflection, and trajectory-focused evaluation are emerging as central tools for making agent behavior dependable.

Why does using a language model as a reasoning engine change what an agent can do compared with fixed step pipelines?

Fixed pipelines hard-code sequences like “do A then do B.” In contrast, an agent chooses actions based on user input and the outcomes of previous actions. That means the action sequence becomes non-deterministic and adaptive: if a tool returns an error or unexpected result, the next step can change accordingly. This is especially important for multi-hop tasks and for tool-dependent workflows like database querying, where edge cases (wrong table/field names, malformed SQL, or incomplete intermediate results) are common.

What is the core loop of a typical tool-using agent, and how does it decide when to stop?

A typical loop: (1) take the user query, (2) have the language model choose a tool and produce tool input, (3) execute the tool, (4) capture the tool’s observation, and (5) feed the observation back into the model to choose the next action. Stopping conditions can be model-driven (the model decides it has enough to answer) or rule-driven (for example, return after five steps without reaching a final answer).

How does ReAct improve over “direct answering,” “chain-of-thought,” and “action-only” approaches?

In the multi-hop Wikipedia-style example: direct answering fails because the model lacks the needed grounded facts. Chain-of-thought (“let’s think step by step”) can improve reasoning but still stays limited to what the model already knows from training. Action-only can retrieve information via tools, but it may lose the reasoning structure needed to integrate results. ReAct combines reasoning and tool calls, aiming to both decide what to do and ground the answer in retrieved observations.

What makes tool selection and tool misuse hard in real deployments?

Agents may not reliably choose tools only when they help. Tool descriptions and prompt instructions can steer selection, but large tool sets create context-length issues. Teams address this with tool retrieval: use embedding search to select a small subset of the most promising tools and pass only those to the model. Another misuse pattern is tool-happy conversational agents; mitigations include explicitly telling the agent it can respond without tools and adding a “return-to-user” tool that encourages appropriate behavior.

Why do output parsers and structured outputs matter for agent reliability?

Tool calls often start as text from the model. Turning that text into executable code requires parsing. Structured outputs like JSON make parsing easier, but chat models may still add extra language. Output parsers encapsulate parsing logic and can retry by fixing formatting errors—sometimes by providing the model the original output plus the parsing error. The transcript also notes subtler cases, like missing required fields, which require more than simple formatting fixes.

How do modern memory approaches differ from simply keeping a list of prior steps?

Keeping a list of steps (as in early ReAct-style setups) runs into context-window limits and becomes brittle for long tasks. Newer approaches retrieve relevant past events and observations using combinations of time-weighting, importance-weighting, and relevancy-weighting. Some systems also add reflection: after many steps, the agent reviews what happened and updates internal state. The transcript also mentions entity memory (extract entities into a graph) and summary memory (running summaries to reduce context load).

Review Questions

What failure modes arise when agents must execute many tool calls over long horizons, and what mitigation strategies were mentioned?
How do tool retrieval and few-shot retrieval differ from providing full tool descriptions in the prompt?
Why is evaluating the agent trajectory (actions and inputs) often as important as evaluating the final natural-language answer?

Key Points

1
Agents treat the language model as a reasoning engine that iteratively selects tools, executes them, observes results, and adapts the next action based on those observations.
2
Tool usage is central because it grounds answers in external data sources (search, databases, APIs) and helps avoid hallucinations and missing knowledge.
3
ReAct (“reasoning” + “acting”) is presented as a key prompting strategy that combines structured reasoning with explicit tool calls to improve multi-hop question answering.
4
Production reliability hinges on correct tool selection, avoiding tool misuse in conversational settings, and robustly converting model outputs into executable tool invocations.
5
Structured outputs (often JSON) plus modular output parsers with retry-and-fix behavior reduce parsing failures and improve end-to-end reliability.
6
Long-running agents need memory strategies beyond raw step logs, including retrieval of relevant past events and handling of large tool outputs.
7
Evaluation should measure both the final result and the agent trajectory—correctness of actions/inputs, step efficiency, and whether the system reaches the goal via valid intermediate steps.

Highlights

Agents replace hard-coded action sequences with adaptive tool-driven loops, where each tool observation reshapes the next decision.

ReAct’s blend of reasoning and acting is positioned as a practical fix for the weaknesses of direct answering, chain-of-thought-only, and action-only prompting.

Reliability problems often come from tool choice (using tools when unnecessary or skipping them when needed) and from brittle parsing of tool-call strings into runnable code.

Memory is evolving from “keep recent steps” into time/importance/relevance-weighted retrieval plus reflection that updates internal state.

Trajectory-focused evaluation can be more informative than only scoring the final natural-language answer.

Topics

Agent Tool Use
ReAct Prompting
Production Reliability
Agent Memory
Agent Evaluation

Mentioned

Harrison Chase
LLM
SQL
API
PR
JSON
ReAct