Claude Sonnet 4.5 | On The Edge #1

TL;DR

Claude Sonnet 4.5 produced a working macOS app from documentation even when the documentation was Python-only and the generated implementation was in Go.

Briefing Cornell Notes

Briefing

Claude Sonnet 4.5 is being positioned as a top-tier coding model—especially for building “complex agents” and working with tools—yet its real-world impact in this test comes down to something simpler: it can turn documentation into a working macOS app and then generate videos on demand with minimal friction.

The tester started with Anthropic’s claims from the release blog—strong performance on reasoning and math benchmarks, and “most aligned” so far—then focused on a hands-on coding workflow. Using browser-based prompting, they uploaded documentation for a video-generation app (built around an API key, an image upload, and a prompt) and instructed the model to produce an executable app for macOS. Even though the documentation was written in Python and there was no Go or C++ reference material, the model produced a complete implementation in Go, along with build artifacts and generated code. The workflow effectively handled the cross-language translation from Python documentation to Go code without the user needing to manually rewrite major components.

Next, the generated Go code was moved into Claude Code for refinement and execution. The result was a functioning macOS application: users can enter an API key, upload an image, add a prompt, and request a generated video. In a live run, the app accepted a prompt like “a beautiful cinematic video with smooth camera movement,” sent the image plus prompt to the API, and returned a video. The tester then opened the output via a URL and watched a short generated clip—described as smooth and working as expected—before running additional prompts.

To sanity-check coding quality beyond “it compiles,” the tester ran a well-known “Pelican test” associated with Simon Willison, comparing Sonnet 4.5’s output to Willison’s earlier results. The outputs looked very similar, suggesting Sonnet 4.5 performs at least competitively on a standardized coding challenge.

The overall takeaway is not that Sonnet 4.5 is a total revolution, but that it delivers incremental gains that matter in practice: fast, tool-capable agent behavior, strong one-shot code generation from documentation, and reliable cross-language implementation (Python docs → Go app) that works out of the box. The tester plans more follow-up work—especially deeper testing in Claude Code and additional evaluation of the API—while also building a longer-running “on the edge” series with tier lists across categories like video, image, text, and coding agents.

Cornell Notes

Claude Sonnet 4.5 is presented as a leading coding model, and the test here focuses on practical outcomes: turning documentation into a working macOS app and using it to generate videos. Starting from Python-based documentation, the model produced Go code and build artifacts without needing Go/C++ docs, then Claude Code was used to run the resulting app. The app accepted an API key, an image, and a prompt, and successfully returned a generated video via a URL. A separate Pelican test comparison (linked to Simon Willison’s long-running benchmark) produced results that looked very similar, reinforcing that the coding quality is competitive. The tester’s verdict: strong first impressions and incremental progress, with more evaluation still needed.

What was the most concrete “coding agent” capability demonstrated with Claude Sonnet 4.5?

It generated a complete macOS app from documentation and then produced working video output. The workflow took an app spec requiring an API key, an image upload, and a prompt, and resulted in a functioning interface where the user could generate a video and open the returned URL in a browser.

How did the test handle the fact that the documentation was written in Python while the generated app was in Go?

The documentation was Python-only, with no Go or C++ references. Despite that, the model produced Go code and build artifacts, effectively translating the Python-described behavior into a Go implementation that ran on macOS.

What did the tester use to validate coding quality beyond “it runs”?

A Pelican test associated with Simon Willison. The tester ran the same style of benchmark on Sonnet 4.5 and compared the output to Willison’s earlier Pelican results; the outputs were described as very similar.

What evidence suggested the video-generation pipeline worked end-to-end?

After entering an API key, uploading an image, and submitting a prompt (“a beautiful cinematic video with smooth camera movement”), the app returned a generated video. The tester then copied a URL and watched the resulting 5-second clip in the browser.

What is the tester’s overall conclusion about Sonnet 4.5’s impact?

Not a “revolution,” but meaningful incremental improvements—especially speed, tool calling/agentic behavior, and reliable one-shot generation from documentation across languages. More testing is planned in Claude Code and with additional API checks.

Review Questions

What steps were required to go from uploaded documentation to a running macOS app, and what language mismatch was resolved?
How did the Pelican test function as a coding-quality check in this workflow?
What specific user inputs did the generated app require to produce a video, and what was the observed output behavior?

Key Points

1
Claude Sonnet 4.5 produced a working macOS app from documentation even when the documentation was Python-only and the generated implementation was in Go.
2
The generated app supported an end-to-end video workflow: API key entry, image upload, prompt submission, and video output via a URL.
3
In a live run, the app generated a short (5-second) cinematic-style video clip and returned it successfully for browser playback.
4
A Pelican test comparison against Simon Willison’s benchmark output looked very similar, suggesting competitive coding performance beyond basic compilation.
5
The tester’s first impression emphasizes speed and agentic tool calling, with cross-language translation handled smoothly in one shot.
6
The overall verdict is incremental progress rather than a total step-change, with more evaluation planned in Claude Code and deeper API testing.

Highlights

Sonnet 4.5 turned Python-only documentation into a Go-based macOS app that ran and generated videos on demand.

A single workflow—API key + image + prompt—produced a playable video returned through a URL.

Pelican test outputs were described as very similar to Simon Willison’s earlier results, reinforcing coding competitiveness.

The tester saw strong speed and tool-calling behavior, but framed the change as incremental rather than revolutionary.

Topics

Claude Sonnet 4.5
Coding Agents
Go App Generation
Video Generation
Pelican Test

Mentioned

Simon Willison