Get AI summaries of any video or article — Sign up free
Was I Wrong About AI Agents? | INSANE OpenAI-o1 Planning Capabilities thumbnail

Was I Wrong About AI Agents? | INSANE OpenAI-o1 Planning Capabilities

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

o1’s strength in this workflow is producing a precise, sequential JSON plan that maps each step to the correct specialized agent.

Briefing

OpenAI o1’s planning ability is the turning point: it can take a long, multi-step instruction list and reliably produce a working, end-to-end result by delegating tasks to specialized agents (image generation, code writing/execution, and file management). Instead of derailing mid-sequence—as earlier agent attempts often did—o1 generates a sequential JSON plan, passes the right prompts to the right tools, carries forward outputs as context, and produces artifacts that actually run (saved images and HTML pages).

The clearest proof comes from a “15-instruction” build that ends with a functioning 1950s-style art gallery webpage. The task mixes arithmetic, media generation, and layout requirements: generate an image of three hamburgers styled like Leonardo da Vinci’s most famous work in France’s most famous museum; compute a “hamburger count” by dividing a Titanic-associated number by 450 and rounding down; create a second image from the top of the tallest building in New York; and then generate HTML that displays both images side-by-side with 1950s website styling and strong readability. o1 outputs a step-by-step plan in JSON, then orchestrates calls to an image agent (using Flux via Replicate), a file agent (download/save/read functions), and a code agent (writes and executes code using a code interpreter tool). The resulting HTML (“artpiece HTML”) loads in a live server and shows the intended gallery layout with the generated images.

That same orchestration pattern scales to a new “weather agent” added to the system. Using the OpenWeatherMap API, the weather agent fetches a three-day forecast for Oslo, then the master planner routes the forecast into the image agent to create weather-representative images with one-word text overlays. The file agent downloads the images and writes a Markdown report (“weather report”) summarizing each day’s conditions and temperatures alongside the images. The first run works on the initial attempt, producing a report for September 29th and subsequent days with clear-sky, overcast, and light-rain visuals.

A final stress test pushes the framework into more chaotic, knowledge-dependent territory: compute the square root of nine to determine “spider pigs,” identify the first episode air date of The Simpsons (via Wikipedia), generate images of Homer and Bart Simpson in weather tied to a forecast for Springfield USA, and update an HTML site (“simpson.html”) with the images and facts. Some steps wobble—there are small mismatches in image details and a couple of knowledge retrieval hiccups—but the system still completes a coherent HTML page with the episode date and weather-based content.

Beyond results, the transcript emphasizes how the architecture works: a master agent produces a precise sequential plan; each sub-agent has strict responsibilities (image generation vs. file operations vs. code execution); and context from earlier steps is stored and reused so later steps don’t guess. The creator’s takeaway is that o1’s planning quality reduces the classic failure mode of agents losing the thread, making agentic workflows feel practical enough to expand into new API integrations and more ambitious projects.

Cornell Notes

o1 can turn a long, instruction-heavy task into a working sequence by generating a precise JSON plan and orchestrating specialized agents. In one example, it builds a 1950s art gallery webpage by combining arithmetic, image generation (Flux via Replicate), file operations (save/download), and code generation/execution (code interpreter). The system succeeds because outputs from each step are stored as context and fed into later steps, preventing many mid-sequence failures. The framework is also extensible: adding a weather agent that calls OpenWeatherMap enables automatic forecast-to-image generation and a Markdown weather report. Even in a more chaotic Simpsons-themed challenge, the approach still produces a coherent HTML result despite a few minor mismatches.

What makes o1’s agent setup different from earlier “agent” attempts described here?

The key change is planning quality. o1 produces a sequential JSON plan that maps each subtask to the correct specialized agent (image, code, or file). That plan is then executed step-by-step while preserving context outputs for dependent steps. Earlier attempts reportedly failed somewhere in the planning sequence; here, the plan stays coherent long enough to generate images, save them, write HTML, and run the code successfully.

How does the system handle a multi-step task like the “artpiece” gallery build?

The master agent breaks the 15 instructions into ordered steps: first call the image agent to generate the hamburgers image; then use the file agent to download/save it; next generate HTML via the code agent; generate a second image (New York skyline/top-of-building perspective) and compute the “spider pigs/hamburgers” number via the specified Titanic/450 rule; then download assets and modify the HTML so both images appear side-by-side with 1950s styling. The final HTML is saved as “artpiece HTML” and runs in a live server.

What roles do the specialized agents play, and why does that matter?

Responsibilities are separated: the image agent generates images from prompts but doesn’t save them; the file agent reads/writes files and downloads images but doesn’t generate images; the code agent writes and executes code (using a code interpreter tool). This separation forces the plan to be explicit about what happens when, and it reduces ambiguity that can cause failures.

How was the weather capability added, and what did it produce?

A new weather agent was added that calls the OpenWeatherMap API to fetch a three-day forecast for a specified location (Oslo in the example). The forecast text is then fed into the image agent to create three images representing the weather, with one-word text overlays. The file agent downloads the images and writes a Markdown report (“weather report”) that lists dates, conditions, temperatures, and embeds the images.

What knowledge-dependent tasks were attempted in the Simpsons challenge, and how did the system respond?

The challenge combined computation (square root of nine to determine “spider pigs”), knowledge retrieval (Simpsons’ first episode air date using Wikipedia), and media generation (Homer and Bart images in weather for Springfield USA). Some details were imperfect—there were mismatches like weather/image inconsistencies and a couple of small failures—but the system still generated a usable HTML page (“simpson.html”) with the episode date and weather-based content.

Why is context passing described as essential in this architecture?

Some steps depend on earlier outputs—for example, image URLs returned by the image agent must be passed to the file agent for downloading, and computed values must be inserted into later prompts or code. The system stores each agent’s results in a context list/dictionary and reuses them in subsequent steps, so later agents don’t have to guess.

Review Questions

  1. In the art gallery example, which specific outputs must be carried forward as context to make later steps work (e.g., image URLs, computed numbers, or generated HTML)?
  2. How does the strict division of labor between image, code, and file agents reduce planning errors compared with a single all-purpose agent?
  3. What kinds of failures still occurred in the Simpsons challenge, and what does that suggest about limits of planning vs. knowledge accuracy?

Key Points

  1. 1

    o1’s strength in this workflow is producing a precise, sequential JSON plan that maps each step to the correct specialized agent.

  2. 2

    The system succeeds when each agent has narrow responsibilities: image generation, file operations, or code writing/execution.

  3. 3

    Context passing is central: outputs like image URLs and computed values are stored and reused so later steps don’t guess.

  4. 4

    Using a code interpreter tool for code execution avoids building a custom execution pipeline and helps keep the workflow end-to-end.

  5. 5

    The architecture is extensible: adding a weather agent that calls OpenWeatherMap enables forecast-to-image generation and automatic report writing.

  6. 6

    Even with strong planning, knowledge-dependent tasks can still produce small mismatches, showing that planning quality doesn’t eliminate all factual or media-detail errors.

Highlights

o1 generated a sequential JSON plan that successfully produced a working “artpiece HTML” page with generated images and correct layout.
Adding a weather agent (OpenWeatherMap → forecast text → image generation → Markdown report) worked on the first attempt and produced a complete “weather report.”
The Simpsons-themed stress test mixed computation, Wikipedia lookup, and weather-based image generation; despite minor mismatches, it still produced a coherent “simpson.html” page.

Topics