Claude 4 - First Tests and Impressions (Claude Code, MCP API, Code Execution Tool++)
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Claude 4’s Claude Code workflow can generate Python, then run it in a secure sandbox via a code execution tool, returning structured execution reports.
Briefing
Claude 4’s new tooling is getting a hands-on stress test: Anthropic’s Claude Code can now run Python inside a secure sandbox via a “code execution tool,” and the Claude API can connect to external capabilities through MCP servers. The practical takeaway from these first experiments is that the workflow is fast to wire up, produces usable execution reports, and can be chained into larger apps—while cost remains relatively modest for small runs.
The testing begins with Claude Code identifying which Claude 4 model is active, then moves into a sandbox execution trial. A Cloud Sonnet agent brainstorms a solution to a user problem, and the proposed Python code is sent to Claude Opus for execution in the sandbox. The example generates 20 random numbers, computes mean, median, and standard deviation, builds a histogram, and flags outliers. The execution tool returns a structured “final report” with workflow success, analysis, edge cases and limitations (like empty lists and single elements), and a code-quality assessment. The experimenter then runs the generated code locally multiple times to verify behavior: the mean varies run-to-run but trends toward the expected value when scaling up to 1,000 random samples (hovering near 50). A quick cost check via “/cost” lands at roughly “almost $1,” suggesting the sandbox execution is not prohibitively expensive for iterative development.
Next comes a more ambitious app built with Claude Code and Cloud Sonnet: a webcam-based object-to-poem-and-voice pipeline. The system captures objects from a live camera feed, identifies them, and uses Claude Opus 4 to write a short poem once three objects are detected. That poem is then sent to the ElevenLabs API for text-to-speech using a configured voice ID, and the audio is played back in the UI. Early runs show the concept working quickly, but object identification based on a weaker approach mislabels items (e.g., confusing a sports ball). The app is then upgraded to use Claude’s vision capabilities for object identification, with the workflow adjusted to take a screenshot, extract object names from the vision response, and populate the list. Subsequent tests improve accuracy: a protein bar is recognized, Bose headphones are identified, and a coffee mug is correctly labeled; the resulting spoken poems match the detected objects and arrive rapidly, implying asynchronous handling in the app.
Finally, the MCP connector is demonstrated as a plug-in style integration. Using an MCP server URL (sourced from mcpservers.org), Claude Opus is prompted to compute a math expression via a “sequential thinking” MCP tool, producing a step-by-step order-of-operations breakdown (PEMDAS) and a verified result. The experiment then swaps the MCP server to a “fetch” tool, showing that changing the MCP server in the API configuration can redirect Claude from reasoning tools to content-fetching tools with minimal code changes.
Overall, these first impressions frame Claude 4’s value less as a single model upgrade and more as an orchestration layer: sandboxed execution, vision-driven app logic, and MCP-based tool connectivity combine into workflows that can iterate quickly, verify outputs, and expand into external services.
Cornell Notes
Claude 4’s new Claude Code workflow adds two capabilities that matter in practice: sandboxed Python execution and MCP-based tool connectivity. In the sandbox test, Claude Sonnet proposes Python, Claude Opus executes it in a secure environment, and the tool returns a structured report including edge cases and code-quality notes. The generated code is then run repeatedly to confirm statistical behavior (means vary for small samples and converge near 50 for 1,000 samples), and a cost check shows the execution run is relatively inexpensive (about $1). A larger app chains webcam object detection → Claude Opus poem writing → ElevenLabs text-to-speech, with improved object accuracy after switching to Claude vision. MCP servers can be swapped to change what tools Claude can call, demonstrated with both sequential math reasoning and URL fetching.
How does the sandbox code execution tool fit into a Claude Code workflow?
What evidence shows the generated Python code is behaving correctly?
What does the cost check reveal about using the execution tool?
How is the webcam app pipeline structured, and what changes improved object recognition?
What does the MCP connector demonstration show about tool integration?
Review Questions
- In the sandbox execution test, what are the three stages from brainstorming to user-facing results, and which model handles each stage?
- Why does increasing the sample size from 20 to 1,000 random numbers matter for validating the generated statistics code?
- What two changes were made to the webcam app to improve object identification accuracy, and how did that affect the final spoken output?
Key Points
- 1
Claude 4’s Claude Code workflow can generate Python, then run it in a secure sandbox via a code execution tool, returning structured execution reports.
- 2
Sandbox execution reports include workflow success plus practical details like edge cases/limitations and code-quality assessments.
- 3
Repeated local runs confirm statistical correctness: small samples produce variable means, while 1,000 samples converge near the expected value around 50.
- 4
A webcam-to-poem-to-speech app can be built by chaining object detection, Claude Opus poem generation, and the ElevenLabs API for voice playback.
- 5
Switching object identification to Claude vision (screenshot → vision response → extracted labels) improves recognition accuracy compared with a less precise approach.
- 6
MCP connector usage is modular: changing the MCP server endpoint (e.g., sequential thinking vs fetch) changes what tools Claude can call with minimal integration work.