Was I Wrong About AI Agents? | INSANE OpenAI-o1 Planning Capabilities
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
o1’s strength in this workflow is producing a precise, sequential JSON plan that maps each step to the correct specialized agent.
Briefing
OpenAI o1’s planning ability is the turning point: it can take a long, multi-step instruction list and reliably produce a working, end-to-end result by delegating tasks to specialized agents (image generation, code writing/execution, and file management). Instead of derailing mid-sequence—as earlier agent attempts often did—o1 generates a sequential JSON plan, passes the right prompts to the right tools, carries forward outputs as context, and produces artifacts that actually run (saved images and HTML pages).
The clearest proof comes from a “15-instruction” build that ends with a functioning 1950s-style art gallery webpage. The task mixes arithmetic, media generation, and layout requirements: generate an image of three hamburgers styled like Leonardo da Vinci’s most famous work in France’s most famous museum; compute a “hamburger count” by dividing a Titanic-associated number by 450 and rounding down; create a second image from the top of the tallest building in New York; and then generate HTML that displays both images side-by-side with 1950s website styling and strong readability. o1 outputs a step-by-step plan in JSON, then orchestrates calls to an image agent (using Flux via Replicate), a file agent (download/save/read functions), and a code agent (writes and executes code using a code interpreter tool). The resulting HTML (“artpiece HTML”) loads in a live server and shows the intended gallery layout with the generated images.
That same orchestration pattern scales to a new “weather agent” added to the system. Using the OpenWeatherMap API, the weather agent fetches a three-day forecast for Oslo, then the master planner routes the forecast into the image agent to create weather-representative images with one-word text overlays. The file agent downloads the images and writes a Markdown report (“weather report”) summarizing each day’s conditions and temperatures alongside the images. The first run works on the initial attempt, producing a report for September 29th and subsequent days with clear-sky, overcast, and light-rain visuals.
A final stress test pushes the framework into more chaotic, knowledge-dependent territory: compute the square root of nine to determine “spider pigs,” identify the first episode air date of The Simpsons (via Wikipedia), generate images of Homer and Bart Simpson in weather tied to a forecast for Springfield USA, and update an HTML site (“simpson.html”) with the images and facts. Some steps wobble—there are small mismatches in image details and a couple of knowledge retrieval hiccups—but the system still completes a coherent HTML page with the episode date and weather-based content.
Beyond results, the transcript emphasizes how the architecture works: a master agent produces a precise sequential plan; each sub-agent has strict responsibilities (image generation vs. file operations vs. code execution); and context from earlier steps is stored and reused so later steps don’t guess. The creator’s takeaway is that o1’s planning quality reduces the classic failure mode of agents losing the thread, making agentic workflows feel practical enough to expand into new API integrations and more ambitious projects.
Cornell Notes
o1 can turn a long, instruction-heavy task into a working sequence by generating a precise JSON plan and orchestrating specialized agents. In one example, it builds a 1950s art gallery webpage by combining arithmetic, image generation (Flux via Replicate), file operations (save/download), and code generation/execution (code interpreter). The system succeeds because outputs from each step are stored as context and fed into later steps, preventing many mid-sequence failures. The framework is also extensible: adding a weather agent that calls OpenWeatherMap enables automatic forecast-to-image generation and a Markdown weather report. Even in a more chaotic Simpsons-themed challenge, the approach still produces a coherent HTML result despite a few minor mismatches.
What makes o1’s agent setup different from earlier “agent” attempts described here?
How does the system handle a multi-step task like the “artpiece” gallery build?
What roles do the specialized agents play, and why does that matter?
How was the weather capability added, and what did it produce?
What knowledge-dependent tasks were attempted in the Simpsons challenge, and how did the system respond?
Why is context passing described as essential in this architecture?
Review Questions
- In the art gallery example, which specific outputs must be carried forward as context to make later steps work (e.g., image URLs, computed numbers, or generated HTML)?
- How does the strict division of labor between image, code, and file agents reduce planning errors compared with a single all-purpose agent?
- What kinds of failures still occurred in the Simpsons challenge, and what does that suggest about limits of planning vs. knowledge accuracy?
Key Points
- 1
o1’s strength in this workflow is producing a precise, sequential JSON plan that maps each step to the correct specialized agent.
- 2
The system succeeds when each agent has narrow responsibilities: image generation, file operations, or code writing/execution.
- 3
Context passing is central: outputs like image URLs and computed values are stored and reused so later steps don’t guess.
- 4
Using a code interpreter tool for code execution avoids building a custom execution pipeline and helps keep the workflow end-to-end.
- 5
The architecture is extensible: adding a weather agent that calls OpenWeatherMap enables forecast-to-image generation and automatic report writing.
- 6
Even with strong planning, knowledge-dependent tasks can still produce small mismatches, showing that planning quality doesn’t eliminate all factual or media-detail errors.