Claude 4 - First Tests and Impressions (Claude Code, MCP API, Code Execution Tool++)

TL;DR

Claude 4’s Claude Code workflow can generate Python, then run it in a secure sandbox via a code execution tool, returning structured execution reports.

Briefing Cornell Notes

Briefing

Claude 4’s new tooling is getting a hands-on stress test: Anthropic’s Claude Code can now run Python inside a secure sandbox via a “code execution tool,” and the Claude API can connect to external capabilities through MCP servers. The practical takeaway from these first experiments is that the workflow is fast to wire up, produces usable execution reports, and can be chained into larger apps—while cost remains relatively modest for small runs.

The testing begins with Claude Code identifying which Claude 4 model is active, then moves into a sandbox execution trial. A Cloud Sonnet agent brainstorms a solution to a user problem, and the proposed Python code is sent to Claude Opus for execution in the sandbox. The example generates 20 random numbers, computes mean, median, and standard deviation, builds a histogram, and flags outliers. The execution tool returns a structured “final report” with workflow success, analysis, edge cases and limitations (like empty lists and single elements), and a code-quality assessment. The experimenter then runs the generated code locally multiple times to verify behavior: the mean varies run-to-run but trends toward the expected value when scaling up to 1,000 random samples (hovering near 50). A quick cost check via “/cost” lands at roughly “almost $1,” suggesting the sandbox execution is not prohibitively expensive for iterative development.

Next comes a more ambitious app built with Claude Code and Cloud Sonnet: a webcam-based object-to-poem-and-voice pipeline. The system captures objects from a live camera feed, identifies them, and uses Claude Opus 4 to write a short poem once three objects are detected. That poem is then sent to the ElevenLabs API for text-to-speech using a configured voice ID, and the audio is played back in the UI. Early runs show the concept working quickly, but object identification based on a weaker approach mislabels items (e.g., confusing a sports ball). The app is then upgraded to use Claude’s vision capabilities for object identification, with the workflow adjusted to take a screenshot, extract object names from the vision response, and populate the list. Subsequent tests improve accuracy: a protein bar is recognized, Bose headphones are identified, and a coffee mug is correctly labeled; the resulting spoken poems match the detected objects and arrive rapidly, implying asynchronous handling in the app.

Finally, the MCP connector is demonstrated as a plug-in style integration. Using an MCP server URL (sourced from mcpservers.org), Claude Opus is prompted to compute a math expression via a “sequential thinking” MCP tool, producing a step-by-step order-of-operations breakdown (PEMDAS) and a verified result. The experiment then swaps the MCP server to a “fetch” tool, showing that changing the MCP server in the API configuration can redirect Claude from reasoning tools to content-fetching tools with minimal code changes.

Overall, these first impressions frame Claude 4’s value less as a single model upgrade and more as an orchestration layer: sandboxed execution, vision-driven app logic, and MCP-based tool connectivity combine into workflows that can iterate quickly, verify outputs, and expand into external services.

Cornell Notes

Claude 4’s new Claude Code workflow adds two capabilities that matter in practice: sandboxed Python execution and MCP-based tool connectivity. In the sandbox test, Claude Sonnet proposes Python, Claude Opus executes it in a secure environment, and the tool returns a structured report including edge cases and code-quality notes. The generated code is then run repeatedly to confirm statistical behavior (means vary for small samples and converge near 50 for 1,000 samples), and a cost check shows the execution run is relatively inexpensive (about $1). A larger app chains webcam object detection → Claude Opus poem writing → ElevenLabs text-to-speech, with improved object accuracy after switching to Claude vision. MCP servers can be swapped to change what tools Claude can call, demonstrated with both sequential math reasoning and URL fetching.

How does the sandbox code execution tool fit into a Claude Code workflow?

The workflow described is: (1) a Claude Sonnet agent brainstorms a solution and generates Python code for a user problem, (2) that code is sent to Claude Opus for testing and execution inside a secure sandbox via the code execution tool, and (3) the system reports back with the execution results. The execution output includes a “workflow success” flag, a final report with problem analysis and solution details, plus explicit edge cases/limitations (e.g., empty lists and single-element inputs) and a code-quality assessment.

What evidence shows the generated Python code is behaving correctly?

After the sandbox run returns a report, the code is copied and executed again. For 20 random values between 1 and 100, the mean changes across runs (e.g., 43, then 58, then ~48). When the experimenter increases to 1,000 random numbers, the mean converges close to the expected value near 50 (examples given include ~50.02, then ~52, then ~49). That convergence is used as a practical correctness check.

What does the cost check reveal about using the execution tool?

A quick “/cost” check after the sandbox execution indicates the expense is “almost $1.” The point isn’t that every run will cost the same, but that small iterative tests with sandbox execution appear financially manageable for development and experimentation.

How is the webcam app pipeline structured, and what changes improved object recognition?

The app captures objects from a webcam, identifies each object’s name, and appends it to a list in the UI. Once three objects are detected, the app asks Claude Opus 4 to write a short PO poem using those objects, then sends the poem to the ElevenLabs API to read it aloud using a configured voice ID, and finally plays the audio in the app. Object recognition initially mislabels items (e.g., confusing a sports ball). The fix is switching to Claude vision: take a screenshot, send it to the vision API, extract object names from the response, and use those names to populate the list—leading to more accurate identifications like a protein bar, Bose headphones, and a coffee mug.

What does the MCP connector demonstration show about tool integration?

MCP servers are treated like swappable tool endpoints in the Claude API. One example uses a “sequential thinking” MCP server to compute a math expression with step-by-step order-of-operations reasoning (PEMDAS), returning a verified result (88 83 1/3 in the transcript, then confirmed by rerunning). Another example swaps to a “fetch” MCP server to retrieve content from a URL, returning formatted fetched information. The key lesson is that switching MCP server configuration can redirect Claude’s tool behavior with minimal code changes.

Review Questions

In the sandbox execution test, what are the three stages from brainstorming to user-facing results, and which model handles each stage?
Why does increasing the sample size from 20 to 1,000 random numbers matter for validating the generated statistics code?
What two changes were made to the webcam app to improve object identification accuracy, and how did that affect the final spoken output?

Key Points

1
Claude 4’s Claude Code workflow can generate Python, then run it in a secure sandbox via a code execution tool, returning structured execution reports.
2
Sandbox execution reports include workflow success plus practical details like edge cases/limitations and code-quality assessments.
3
Repeated local runs confirm statistical correctness: small samples produce variable means, while 1,000 samples converge near the expected value around 50.
4
A webcam-to-poem-to-speech app can be built by chaining object detection, Claude Opus poem generation, and the ElevenLabs API for voice playback.
5
Switching object identification to Claude vision (screenshot → vision response → extracted labels) improves recognition accuracy compared with a less precise approach.
6
MCP connector usage is modular: changing the MCP server endpoint (e.g., sequential thinking vs fetch) changes what tools Claude can call with minimal integration work.

Highlights

Sandbox execution produced a structured “final report” with workflow success, edge cases, and code-quality notes—then the generated code was validated by rerunning it and observing statistical convergence.

The webcam app worked end-to-end (objects → Claude Opus poem → ElevenLabs voice), and accuracy improved after moving object labeling to Claude vision.

MCP integration behaved like a plug-in system: swapping MCP servers redirected Claude from step-by-step reasoning to URL content fetching.

Topics

Claude Code
Code Execution Tool
MCP Connector
Vision Object Recognition
ElevenLabs Text to Speech

Mentioned

Anthropic
Claude
ElevenLabs
Nvidia
Bose
RTX
Jason Hong
MCP
API
UI
PEMDAS
LLMs
SVA