Livecoding: ICE and the Factored Cognition Primer by Ought

TL;DR

Iterated decomposition improves LLM reliability by evaluating smaller subtasks separately, then using failing-subtask inspection to decide whether fixes belong in prompts, retrieval, code flow, or decomposition.

Briefing Cornell Notes

Briefing

High-reliability language-model apps increasingly come from breaking tasks into smaller, testable steps—then using execution traces to pinpoint exactly which sub-step fails. In this live coding walkthrough of Ought’s “Factored Cognition Primer,” the central idea is iterated decomposition: split a complex goal (like question answering) into meaningful subtasks, evaluate each piece with “gold standards” (human answers, stronger models, or datasets), and then inspect failing subtasks to decide whether the fix belongs in prompting, code flow, retrieval, or even the decomposition itself. That approach boosts reliability compared with end-to-end prompting, and it mirrors how mature software teams build systems: modular components, targeted evaluation, and a feedback loop that improves over time.

The walkthrough then demonstrates the tooling behind that workflow: ICE (Iterative Composition Explorer). Starting from a “hello world” async Python recipe, ICE traces only async functions—so normal synchronous helper code doesn’t clutter the trace. The UI renders a tree/DAG of calls, showing inputs and outputs per async node, and lets developers drill into the exact prompt text and parameters sent to model calls. That trace-first design becomes the debugging backbone once language models and external services enter the picture.

The primer’s first real example is question answering without external context. A prompt is built using F-values (a specialized f-string helper), and the trace shows the full prompt sent to an OpenAI agent completion call. The walkthrough highlights practical prompt hygiene: stripping trailing whitespace matters because tokenizers treat leading spaces differently, and the example uses a stop sequence trick (a closing quote) to prevent completion models from drifting into generating extra, unintended text.

Next comes question answering with context. The system feeds a background snippet (e.g., an event date) and asks a question about a different date to test whether the model correctly reasons about relevance. When the model misbehaves—answering based on the background even when it shouldn’t—the trace confirms the intended context was actually passed, shifting suspicion from “coding mistake” to “model reasoning and prompt design.” The walkthrough then suggests that better relevance handling may require prompt engineering (e.g., instructing the model to verify context relevance) or few-shot examples.

The session also showcases iterative improvement via a “fixer prompt” loop: generate a long, step-by-step answer, then repeatedly ask a second pass to shorten and tighten it until the output converges. ICE makes the cost and iteration count visible, showing how many times the fixer runs before stabilization.

To demonstrate more advanced control, the walkthrough builds a debate workflow where two agents take opposing positions (Alice vs. Bob) for a fixed number of turns, and a judge agent decides the winner. Tracing reveals token/cost growth as the debate context expands, and it even shows how the judge’s final answer can change after seeing the debate.

Finally, the walkthrough moves into long-text and tool use. For long documents, it motivates decomposition into “find relevant sections” plus “answer using those sections” to fit within context windows. For tool use, it implements web search via serp API: the model chooses a query, retrieves results, renders them into a prompt, and then answers using those snippets. Tracing exposes failures like invalid API keys (previously hidden by error-swallowing) and incorrect answers caused by snippet quality. The session then attempts an exercise to fetch and include the first web page’s content, running into practical web constraints (403/forbidden) and HTML parsing challenges—again underscoring why traceable, modular workflows matter.

Overall, the walkthrough frames ICE not as a novelty, but as a development instrument: it turns LLM behavior into inspectable artifacts—prompts, tool calls, intermediate outputs, and iteration counts—so teams can systematically improve reliability rather than guessing where things went wrong.

Cornell Notes

Ought’s “Factored Cognition Primer” argues that higher-reliability LLM applications come from iterated decomposition: split a hard task into smaller subtasks, evaluate each subtask with clear benchmarks, and use trace-based inspection of failures to decide whether fixes belong in prompting, retrieval, or further decomposition. The walkthrough demonstrates ICE (Iterative Composition Explorer), which traces only async functions and renders a call tree with per-node inputs/outputs, making it easier to debug what prompts and tool results actually reached the model. Examples progress from simple QA to context-based QA (where relevance mistakes can be diagnosed via traces), to iterative “fixer” loops that shorten answers until convergence, to debate workflows with a judge agent. Tool use with serp API shows how tracing reveals hidden external-service errors and how snippet quality affects final answers.

Why does iterated decomposition improve reliability compared with end-to-end prompting?

The workflow splits a complex goal (like answering a question) into smaller, meaningful subtasks that can be evaluated independently. Instead of judging the whole system at once—where failures are hard to localize—each subtask gets its own “gold standard” evaluation (human outputs, stronger-model outputs, or datasets). When a subtask fails, developers can inspect that specific failure mode and apply targeted fixes: prompt changes, code/control-flow changes, retrieval adjustments in RAG, or even another round of decomposition if the chosen subtasks are still too hard.

What does ICE trace, and why does that matter for debugging?

ICE traces async functions inside a recipe entry point (recipe.main). Synchronous helper code doesn’t appear as trace nodes, which keeps the trace readable. When language-model calls and external API calls are async, ICE records their inputs and outputs and shows them in a tree/DAG structure. That makes it possible to verify whether the intended prompt/context actually reached the model, and to see intermediate tool results (e.g., serp API responses) that would otherwise be hidden.

How do prompt details like trailing whitespace and stop sequences affect model behavior?

Trailing whitespace can change tokenization behavior, and many tokenizers treat leading spaces as part of the token identity. The walkthrough emphasizes stripping trailing whitespace from prompts to avoid generating awkward first tokens. It also uses a stop sequence trick in completion-style prompting: the prompt wraps the question in quotes and the generation is encouraged to start the answer with a quote, so a closing quote becomes a reliable stopping point that prevents the model from continuing into extra, unintended content.

What went wrong in context-based QA, and how did traces help?

A background snippet contained information about one date, while the question asked about a different date. The model answered as if the background were relevant, producing an incorrect response. The trace helped rule out a common developer error (forgetting to pass the intended context) because the recorded model inputs showed the exact background text that was sent. That shifted the fix toward prompt design (e.g., instructing the model to check relevance) or better decomposition rather than debugging the code path.

How does the “fixer prompt” loop work, and what does convergence look like in practice?

The system first generates a longer, step-by-step answer, then repeatedly calls a fixer prompt that asks for a shorter, improved version. A loop continues until the model’s output stops changing enough (the walkthrough compares strings/edits and notes that smarter stopping criteria like embedding distance could reduce calls). ICE reveals how many iterations were needed before convergence and how the output length and wording change across passes.

Why does tool use require careful handling of external errors and result formatting?

When serp API returned an error (e.g., invalid API key), the code path returned “no results found” instead of crashing. Without tracing, that failure could be invisible because errors were effectively swallowed. ICE exposed the intermediate API error in the trace. Even with successful retrieval, the model’s answer quality depends on snippet quality and how results are rendered into the prompt; incorrect or incomplete snippets can lead to wrong answers.

Review Questions

How would you design subtask evaluation for a QA system so that failures are attributable to retrieval vs. reasoning?
What specific trace evidence would convince you that a context-passing bug is not the cause of a model’s wrong answer?
In a tool-augmented workflow, what are two distinct ways tracing can prevent silent failures from reaching end users?

Key Points

1
Iterated decomposition improves LLM reliability by evaluating smaller subtasks separately, then using failing-subtask inspection to decide whether fixes belong in prompts, retrieval, code flow, or decomposition.
2
ICE traces only async functions, producing a readable call tree with per-node inputs/outputs—making it easier to verify what prompts and tool results actually reached the model.
3
Prompt hygiene matters: stripping trailing whitespace avoids tokenization quirks, and stop sequences can prevent completion models from generating extra unintended content.
4
Context-based QA failures can be diagnosed by traces that confirm the exact background text sent to the model, shifting debugging toward relevance reasoning and prompt design.
5
Iterative improvement (fixer loops) can shorten and refine outputs over multiple passes; tracing makes iteration count and cost visible so convergence behavior can be measured.
6
Debate workflows (Alice vs. Bob plus a judge) can change final answers after seeing opposing arguments, but token/cost grows as debate context accumulates.
7
Tool use with web search requires robust error visibility and careful result rendering; tracing helps catch hidden API failures and explains why snippet quality drives answer accuracy.

Highlights

ICE turns LLM debugging into artifact inspection: prompts, parameters, tool calls, and intermediate outputs are visible per async node.

Breaking QA into “relevant context selection” plus “answer generation” is positioned as the key strategy for long-text problems that exceed context windows.

Stop-sequence prompting with quotes is used to control completion-style models and prevent them from drifting into extra text.

The fixer-prompt loop behaves like iterative denoising: multiple passes reduce verbosity until outputs converge, and ICE shows how many iterations were needed.

Tracing reveals silent external-service failures (like invalid API keys) that would otherwise be swallowed and misdiagnosed as model behavior.

Topics

Iterated Decomposition
ICE Tracing
Prompt Hygiene
Tool Use
Long-Text QA