CrewAI - Building a Custom Crew
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use a callback step to log each agent action (step number, agent name, and full payload) so hierarchical runs can be debugged when they loop or drift.
Briefing
A custom CrewAI workflow can reliably turn a user-chosen topic into a researched, saved markdown article—but the “process shape” matters. In a sequential setup, the system asks for a topic once, runs web search, drafts an article with a required structure, and saves the result cleanly. Switching to a hierarchical process improves flexibility for multi-step reasoning and comparisons, yet it introduces new failure modes: repeated clarification prompts, more tool calls, and occasional trouble saving the final output.
The build starts with instrumentation. Instead of relying on LangSmith integration, the workflow uses a callback step after each agent action to log outputs to a text file (“crew called backlogs”) with step numbers and the agent name. That logging becomes especially valuable in hierarchical runs, where agents can loop, stall, or drift into the wrong subtask. The callback payload may arrive as an “agent finish” object or a dictionary, so the logging function is designed to capture the full contents for debugging.
For external inputs, the system uses a free Duck Duck Go search tool (no API key required). For language generation, it configures GPT 4 turbo as the model for each agent—an explicit choice meant to control cost compared with default GPT-4 usage. A second tool saves generated content to a markdown file, returning the filename and signaling completion to the user. The workflow also includes “human tools,” enabling the system to ask the user for the research topic at runtime.
Three core agents form the sequential pipeline: a research specialist that performs searches and compiles a report, a writer that converts that report into an article with clear formatting requirements (at least three paragraphs plus bullet-point key facts at the end), and a file archiver that writes the final markdown. Tasks enforce guardrails: the first task requests the topic from the human and demands a comprehensive report on “latest advancements” for that exact subject, discouraging invented topics. The article task specifies structure and output expectations, while the saving task expects a string to write.
When run sequentially, the crew quickly prompts for the topic and successfully handles a concrete example: Jamba, described as an AI21 state space model + transformer approach. It generates search queries including “latest news 2024,” drafts the article, saves it to markdown with a date and random suffix, and produces a final confirmation.
The hierarchical version keeps the same overall goal but changes the process to hierarchical planning. It fails to reliably ask the human for the topic using the original prompt, so the design adds a dedicated “topic getter” agent that consults the human via human tools, then hands the topic to the research/search agent. This hierarchical flow often asks for clarification multiple times and can branch into extra steps—such as comparisons involving GPT 3—before converging on an article.
However, hierarchical planning also increases operational friction. The system attempts the save tool multiple times, errors out twice, then corrects the tool input and finally saves. The final article still misses some formatting requirements (notably the bullet-point section), and acronyms like SSM may not be expanded early enough—suggesting the need for tighter prompt constraints (e.g., spelling out acronyms on first use) and possibly additional tools to fetch URLs or full page content for stronger citations.
Overall, the transcript shows a practical blueprint for building custom multi-agent research-to-article pipelines, while highlighting that hierarchical control can improve depth but demands stronger guardrails, clearer input typing, and more robust tool-handling logic.
Cornell Notes
CrewAI can generate a researched markdown article from a user-provided topic by chaining specialized agents: a topic/research step, an article-writing step, and a file-saving step. In sequential mode, the system asks once for the topic, performs Duck Duck Go searches, drafts an article with required structure, and saves it successfully. Hierarchical mode adds flexibility and can perform extra reasoning steps (including comparisons), but it may ask for clarification repeatedly and can struggle with tool inputs—especially when saving the final output. Logging via callback steps is crucial for diagnosing loops and tool failures in hierarchical runs.
Why add a callback-based logging step instead of relying on LangSmith?
What tools and model choices shape the pipeline’s behavior and cost?
How do tasks prevent the system from researching the wrong thing?
What changed when moving from sequential to hierarchical planning?
What failure mode appears in hierarchical runs around saving output?
What quality gaps remain even when the hierarchical run succeeds?
Review Questions
- In sequential mode, which three agents are responsible for research, writing, and saving, and how do the tasks enforce the article’s required structure?
- What specific design change was needed to make hierarchical planning ask the human for the topic, and why did the original approach fail?
- How does callback logging help diagnose hierarchical issues like loops, clarification repeats, and tool-input errors during saving?
Key Points
- 1
Use a callback step to log each agent action (step number, agent name, and full payload) so hierarchical runs can be debugged when they loop or drift.
- 2
Constrain research tasks to the human-provided topic to prevent agents—especially under hierarchical planning—from inventing new subjects.
- 3
Prefer GPT 4 turbo for per-agent generation to manage cost compared with default GPT-4 usage.
- 4
Separate responsibilities into agents: research/report creation, article drafting with explicit formatting rules, and a dedicated markdown-saving tool.
- 5
In hierarchical mode, add a dedicated topic-getter agent using human tools when the main prompt fails to elicit user input reliably.
- 6
Expect hierarchical planning to increase tool calls and retries; add stronger typing/format constraints for tool inputs to reduce save-tool errors.
- 7
After generation, validate output requirements (bullet points, acronym expansions) and tighten prompts to enforce them.