Create Your "Small" Action Model with GPT-4o
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The pipeline records short screen sessions as screenshots, then uses GPT-4o vision to infer the chronological sequence of user actions.
Briefing
A “small action model” framework turns a short burst of user screen activity into a repeatable automation script by using GPT-4o’s vision to infer what happened and then GPT-4o again to generate code that replays it. The core idea is simple: record screenshots while a person performs actions on a computer, analyze those images to reconstruct the action sequence, convert that sequence into a step-by-step plan, generate Python code to execute the plan, and then run the code to recreate the same workflow.
The workflow is split into two phases. In the recording phase, the system captures screenshots at a fixed rate—set to two frames per second in the example—over a short window (the transcript mentions a 15-second duration). Each screenshot is stored in a folder. In the execute phase, every saved image is converted to base64 and sent to GPT-4o for vision-based analysis. The goal is to determine the sequential order of the user’s actions across the screenshot timeline, effectively turning visual snapshots into an ordered “what happened” description.
Once the action sequence is extracted, GPT-4o is prompted to produce a step-by-step plan for recreating the actions. That plan then becomes input to another GPT-4o call that generates Python code designed to reproduce the same behavior. The generated code uses operating-system controls for mouse and keyboard input rather than browser automation tools like Selenium. The prompt also instructs the model to output only the code, and the project includes a “clean generated code” function to strip away extraneous formatting so the result can be saved as a runnable script (for example, action.py).
A practical test demonstrates the pipeline end-to-end. The system records a brief interaction: opening the start menu, launching Chrome, navigating to google.com, searching for “Taylor Swift down bad,” and clicking through to the resulting YouTube video. After analysis, the system prints the inferred action sequence and the generated plan, then produces Python code. When executed, the script successfully repeats the same steps—opening Chrome, visiting google.com, entering the query, and selecting the video—matching the recorded behavior closely enough to be considered a “good pass.”
The project is positioned as a starting framework rather than a finished product. The author notes that improvements are likely, and invites others to experiment and report ideas. Future directions hinted in the transcript include expanding capabilities with additional inputs (such as video or “voices”) and potentially using local vision models to reduce reliance on hosted services. Overall, the approach reframes “small” action automation as a vision-to-plan-to-code pipeline, where short visual evidence becomes executable instructions.
Cornell Notes
The framework builds a “small action model” that recreates a user’s computer actions from screenshots. It records screen captures at a chosen rate (two frames per second in the example) into a folder, converts each image to base64, and uses GPT-4o vision to infer the sequential order of actions. That inferred sequence is turned into a step-by-step plan, which then feeds a code-generation step that outputs Python automation code. The code replays actions using OS-level mouse/keyboard controls (avoiding Selenium), then runs the script to reproduce the workflow. A test successfully repeated a short Chrome/Google/YouTube interaction, suggesting the pipeline can translate visual snapshots into runnable automation.
How does the system turn a burst of screen activity into an ordered action sequence?
What role does the “step-by-step plan” play between vision analysis and code execution?
Why does the generated code avoid Selenium, and what does it use instead?
How is the generated code made runnable, given that LLM outputs can include extra formatting?
What was the concrete end-to-end test that validated the pipeline?
Review Questions
- What are the inputs and outputs of each phase in the pipeline (recording vs. execute)?
- Why might converting screenshots to base64 and sending them to GPT-4o vision work better for action reconstruction than relying on a single screenshot?
- How does using OS mouse/keyboard controls change the kinds of actions the generated code can reliably reproduce compared with browser automation tools?
Key Points
- 1
The pipeline records short screen sessions as screenshots, then uses GPT-4o vision to infer the chronological sequence of user actions.
- 2
Every screenshot is base64-encoded and analyzed so the model can reconstruct what happened across time, not just in one frame.
- 3
An intermediate step-by-step plan converts the inferred action sequence into structured instructions for code generation.
- 4
Generated Python code replays actions via OS-level mouse and keyboard control rather than Selenium or similar browser automation.
- 5
A code-cleaning step removes extra formatting so the model’s output can be saved and executed as a runnable script.
- 6
A test run successfully recreated a Chrome/Google search and YouTube click sequence, indicating the approach can translate visual evidence into automation.
- 7
The project is designed as a framework for iteration, with room for improvements and potential future expansion using additional model sources.