Create Your "Small" Action Model with GPT-4o

TL;DR

The pipeline records short screen sessions as screenshots, then uses GPT-4o vision to infer the chronological sequence of user actions.

Briefing Cornell Notes

Briefing

A “small action model” framework turns a short burst of user screen activity into a repeatable automation script by using GPT-4o’s vision to infer what happened and then GPT-4o again to generate code that replays it. The core idea is simple: record screenshots while a person performs actions on a computer, analyze those images to reconstruct the action sequence, convert that sequence into a step-by-step plan, generate Python code to execute the plan, and then run the code to recreate the same workflow.

The workflow is split into two phases. In the recording phase, the system captures screenshots at a fixed rate—set to two frames per second in the example—over a short window (the transcript mentions a 15-second duration). Each screenshot is stored in a folder. In the execute phase, every saved image is converted to base64 and sent to GPT-4o for vision-based analysis. The goal is to determine the sequential order of the user’s actions across the screenshot timeline, effectively turning visual snapshots into an ordered “what happened” description.

Once the action sequence is extracted, GPT-4o is prompted to produce a step-by-step plan for recreating the actions. That plan then becomes input to another GPT-4o call that generates Python code designed to reproduce the same behavior. The generated code uses operating-system controls for mouse and keyboard input rather than browser automation tools like Selenium. The prompt also instructs the model to output only the code, and the project includes a “clean generated code” function to strip away extraneous formatting so the result can be saved as a runnable script (for example, action.py).

A practical test demonstrates the pipeline end-to-end. The system records a brief interaction: opening the start menu, launching Chrome, navigating to google.com, searching for “Taylor Swift down bad,” and clicking through to the resulting YouTube video. After analysis, the system prints the inferred action sequence and the generated plan, then produces Python code. When executed, the script successfully repeats the same steps—opening Chrome, visiting google.com, entering the query, and selecting the video—matching the recorded behavior closely enough to be considered a “good pass.”

The project is positioned as a starting framework rather than a finished product. The author notes that improvements are likely, and invites others to experiment and report ideas. Future directions hinted in the transcript include expanding capabilities with additional inputs (such as video or “voices”) and potentially using local vision models to reduce reliance on hosted services. Overall, the approach reframes “small” action automation as a vision-to-plan-to-code pipeline, where short visual evidence becomes executable instructions.

Cornell Notes

The framework builds a “small action model” that recreates a user’s computer actions from screenshots. It records screen captures at a chosen rate (two frames per second in the example) into a folder, converts each image to base64, and uses GPT-4o vision to infer the sequential order of actions. That inferred sequence is turned into a step-by-step plan, which then feeds a code-generation step that outputs Python automation code. The code replays actions using OS-level mouse/keyboard controls (avoiding Selenium), then runs the script to reproduce the workflow. A test successfully repeated a short Chrome/Google/YouTube interaction, suggesting the pipeline can translate visual snapshots into runnable automation.

How does the system turn a burst of screen activity into an ordered action sequence?

It records screenshots at a fixed interval (set to two frames per second) for a short duration (15 seconds in the example). During execution, it processes every screenshot by base64-encoding the images and sending them to GPT-4o for vision analysis. The model’s output is used to infer the sequential order of the user’s actions across the timeline, producing an ordered description of what happened.

What role does the “step-by-step plan” play between vision analysis and code execution?

The plan acts as a structured intermediate representation. After GPT-4o analyzes the images and yields an action sequence, another GPT-4o call converts that sequence into explicit steps. That plan is then provided as input to a code-generation prompt, which uses the steps to produce Python code intended to recreate the same behavior.

Why does the generated code avoid Selenium, and what does it use instead?

The code-generation prompt explicitly steers away from Selenium and similar browser automation libraries. Instead, it instructs the model to generate Python code that uses OS controls—mouse and keyboard input—to interact with the user’s computer and browser. This keeps the automation grounded in general input control rather than browser-specific drivers.

How is the generated code made runnable, given that LLM outputs can include extra formatting?

A “clean generated code” function is included to remove unwanted artifacts from the model’s output. The workflow then saves the cleaned result into a script file (e.g., action.py) so it can be executed directly without manual editing.

What was the concrete end-to-end test that validated the pipeline?

The system recorded a short interaction: launching Chrome, going to google.com, searching for “Taylor Swift down bad,” and clicking a resulting YouTube video. After analysis, it generated a plan and Python code. Running the script reproduced the same sequence—opening Chrome, navigating to Google, entering the query, and selecting the video—serving as a successful demonstration.

Review Questions

What are the inputs and outputs of each phase in the pipeline (recording vs. execute)?
Why might converting screenshots to base64 and sending them to GPT-4o vision work better for action reconstruction than relying on a single screenshot?
How does using OS mouse/keyboard controls change the kinds of actions the generated code can reliably reproduce compared with browser automation tools?

Key Points

1
The pipeline records short screen sessions as screenshots, then uses GPT-4o vision to infer the chronological sequence of user actions.
2
Every screenshot is base64-encoded and analyzed so the model can reconstruct what happened across time, not just in one frame.
3
An intermediate step-by-step plan converts the inferred action sequence into structured instructions for code generation.
4
Generated Python code replays actions via OS-level mouse and keyboard control rather than Selenium or similar browser automation.
5
A code-cleaning step removes extra formatting so the model’s output can be saved and executed as a runnable script.
6
A test run successfully recreated a Chrome/Google search and YouTube click sequence, indicating the approach can translate visual evidence into automation.
7
The project is designed as a framework for iteration, with room for improvements and potential future expansion using additional model sources.

Highlights

Screenshots captured at two frames per second are the raw “training” signal for reconstructing user intent.

GPT-4o vision is used to infer action order, then GPT-4o generates both a step plan and executable Python code.

Automation relies on OS mouse/keyboard control, explicitly steering away from Selenium.

A short interaction—Google search for “Taylor Swift down bad” and clicking a YouTube result—was successfully replayed by the generated script.

Topics

Action Automation
GPT-4o Vision
Screenshot Recording
Code Generation
OS Mouse Keyboard Control

Mentioned

GPT-4o