CrewAI - Building a Custom Crew

TL;DR

Use a callback step to log each agent action (step number, agent name, and full payload) so hierarchical runs can be debugged when they loop or drift.

Briefing Cornell Notes

Briefing

A custom CrewAI workflow can reliably turn a user-chosen topic into a researched, saved markdown article—but the “process shape” matters. In a sequential setup, the system asks for a topic once, runs web search, drafts an article with a required structure, and saves the result cleanly. Switching to a hierarchical process improves flexibility for multi-step reasoning and comparisons, yet it introduces new failure modes: repeated clarification prompts, more tool calls, and occasional trouble saving the final output.

The build starts with instrumentation. Instead of relying on LangSmith integration, the workflow uses a callback step after each agent action to log outputs to a text file (“crew called backlogs”) with step numbers and the agent name. That logging becomes especially valuable in hierarchical runs, where agents can loop, stall, or drift into the wrong subtask. The callback payload may arrive as an “agent finish” object or a dictionary, so the logging function is designed to capture the full contents for debugging.

For external inputs, the system uses a free Duck Duck Go search tool (no API key required). For language generation, it configures GPT 4 turbo as the model for each agent—an explicit choice meant to control cost compared with default GPT-4 usage. A second tool saves generated content to a markdown file, returning the filename and signaling completion to the user. The workflow also includes “human tools,” enabling the system to ask the user for the research topic at runtime.

Three core agents form the sequential pipeline: a research specialist that performs searches and compiles a report, a writer that converts that report into an article with clear formatting requirements (at least three paragraphs plus bullet-point key facts at the end), and a file archiver that writes the final markdown. Tasks enforce guardrails: the first task requests the topic from the human and demands a comprehensive report on “latest advancements” for that exact subject, discouraging invented topics. The article task specifies structure and output expectations, while the saving task expects a string to write.

When run sequentially, the crew quickly prompts for the topic and successfully handles a concrete example: Jamba, described as an AI21 state space model + transformer approach. It generates search queries including “latest news 2024,” drafts the article, saves it to markdown with a date and random suffix, and produces a final confirmation.

The hierarchical version keeps the same overall goal but changes the process to hierarchical planning. It fails to reliably ask the human for the topic using the original prompt, so the design adds a dedicated “topic getter” agent that consults the human via human tools, then hands the topic to the research/search agent. This hierarchical flow often asks for clarification multiple times and can branch into extra steps—such as comparisons involving GPT 3—before converging on an article.

However, hierarchical planning also increases operational friction. The system attempts the save tool multiple times, errors out twice, then corrects the tool input and finally saves. The final article still misses some formatting requirements (notably the bullet-point section), and acronyms like SSM may not be expanded early enough—suggesting the need for tighter prompt constraints (e.g., spelling out acronyms on first use) and possibly additional tools to fetch URLs or full page content for stronger citations.

Overall, the transcript shows a practical blueprint for building custom multi-agent research-to-article pipelines, while highlighting that hierarchical control can improve depth but demands stronger guardrails, clearer input typing, and more robust tool-handling logic.

Cornell Notes

CrewAI can generate a researched markdown article from a user-provided topic by chaining specialized agents: a topic/research step, an article-writing step, and a file-saving step. In sequential mode, the system asks once for the topic, performs Duck Duck Go searches, drafts an article with required structure, and saves it successfully. Hierarchical mode adds flexibility and can perform extra reasoning steps (including comparisons), but it may ask for clarification repeatedly and can struggle with tool inputs—especially when saving the final output. Logging via callback steps is crucial for diagnosing loops and tool failures in hierarchical runs.

Why add a callback-based logging step instead of relying on LangSmith?

The workflow uses a callback step after each agent action to write detailed logs to a text file (“crew called backlogs”). Each entry includes the step number and which agent produced it, capturing either an “agent finish” style payload or a dictionary. This makes debugging practical—especially in hierarchical mode, where agents can loop or drift—while avoiding the transcript’s note that LangSmith integration wasn’t easily implemented in CrewAI at the time.

What tools and model choices shape the pipeline’s behavior and cost?

External research comes from a free Duck Duck Go search tool (no API key). Writing and reasoning use GPT 4 turbo configured per agent to control expense versus default GPT-4 behavior. A separate “save content” tool writes the final string to a markdown file and returns the filename, while “human tools” allow runtime prompting for the topic.

How do tasks prevent the system from researching the wrong thing?

Task instructions explicitly require the human to provide the topic and the research output to focus on “latest advancements” for that specified subject. The transcript stresses that without such constraints—particularly in hierarchical setups—agents may invent topics or wander. The article task also enforces structure: at least three paragraphs and bullet-point key facts at the end.

What changed when moving from sequential to hierarchical planning?

Hierarchical planning didn’t reliably ask the human for the topic using the original prompt. To fix that, the design adds a dedicated “topic getter” agent that consults the human via human tools, then passes the topic to the research/search agent. The hierarchical flow also tends to repeat steps and request clarification more than once.

What failure mode appears in hierarchical runs around saving output?

The hierarchical run attempts the markdown-saving tool multiple times and errors out twice before correcting the tool input. The transcript notes that the system initially misunderstands the input expected by the tool, then retries with the corrected input and eventually saves successfully.

What quality gaps remain even when the hierarchical run succeeds?

Even after saving, the hierarchical output may miss formatting requirements—such as the bullet-point section at the end. Acronyms like SSM may also fail to be expanded early enough; the transcript suggests tightening prompts so acronyms are spelled out in the first paragraph or first chunk of text.

Review Questions

In sequential mode, which three agents are responsible for research, writing, and saving, and how do the tasks enforce the article’s required structure?
What specific design change was needed to make hierarchical planning ask the human for the topic, and why did the original approach fail?
How does callback logging help diagnose hierarchical issues like loops, clarification repeats, and tool-input errors during saving?

Key Points

1
Use a callback step to log each agent action (step number, agent name, and full payload) so hierarchical runs can be debugged when they loop or drift.
2
Constrain research tasks to the human-provided topic to prevent agents—especially under hierarchical planning—from inventing new subjects.
3
Prefer GPT 4 turbo for per-agent generation to manage cost compared with default GPT-4 usage.
4
Separate responsibilities into agents: research/report creation, article drafting with explicit formatting rules, and a dedicated markdown-saving tool.
5
In hierarchical mode, add a dedicated topic-getter agent using human tools when the main prompt fails to elicit user input reliably.
6
Expect hierarchical planning to increase tool calls and retries; add stronger typing/format constraints for tool inputs to reduce save-tool errors.
7
After generation, validate output requirements (bullet points, acronym expansions) and tighten prompts to enforce them.

Highlights

Sequential planning prompts for a topic once, performs Duck Duck Go searches, drafts a structured article, and saves it to markdown successfully.

Hierarchical planning often asks for clarification multiple times and can branch into extra reasoning steps like comparisons, increasing both depth and operational risk.

Saving in hierarchical mode may fail initially due to incorrect tool inputs, but retries can recover once the input format is corrected.

Callback logging is positioned as essential for spotting where hierarchical agents get stuck, loop, or drift.

Topics

CrewAI Custom Crew
Sequential vs Hierarchical
Agent Tools
Callback Logging
Markdown Saving

Mentioned

Sam Witteveen
GPT
GPT 4 turbo
SSM