Use THIS Today to Make Your Local LLM Smarter + Claude 3 Opus Tips

TL;DR

Use a `should_use_code` decision step to run Python only when computation or external data would improve the answer.

Briefing Cornell Notes

Briefing

A practical “local LLM smarter” workflow hinges on one decision: only run Python when a question truly needs computation, then feed the code’s output back into the model as retrieval-like context. The result is a system that can answer ordinary questions normally, but can also calculate, fetch live data, and even self-correct when the first attempt fails—without forcing heavy tooling on every prompt.

The core mechanism starts with a classifier step called `should_use_code`. A system message instructs the model to judge whether Python execution would improve the answer; it returns “Yes” for calculation-heavy queries and “No” when natural language is sufficient. When “Yes” triggers, the system generates Python code, executes it in a terminal, stores the result in a `code_output` variable, and then prompts the model again using both the original user question and the `code_output` as context. The author frames this second stage as similar to RAG: the model doesn’t just guess—it grounds its final response in the computed or fetched output.

The workflow is tested first with ChatGPT 3.5 via the API, using a custom arithmetic check (the transcript mentions a target of 36 apples). An initial run produces an incorrect value (32), but the system then generates corrective code and reaches the correct total of 36 after the self-correction loop.

Next, the same logic is swapped from OpenAI calls to a local model served through LM Studio. Using the “mistol 7B open Hermes 2.5” setup, the system handles live queries that require computation or external data. For “What is the price of Bitcoin today,” it installs needed packages, runs code, and returns a numeric price (the transcript shows 72287) and a natural-language confirmation. For “What is the weather forecast for London this upcoming weekend,” it uses an OpenWeatherMap API key and returns a forecast with specific fields like temperature and humidity (the transcript notes a Fahrenheit-looking temperature and includes wind speed/humidity details).

A more complex finance question—buying one Bitcoin a year ago, then computing today’s value and percentage gain—tests API reliability. A CoinDesk-based attempt fails with “unable to fetch historical price,” prompting a switch to ChatGPT 3.5 and a CoinGecko API. That change produces a plausible result: one-year-ago cost around $24,178 and a gain of about 195%, which the author cross-checks against a one-year chart.

Finally, Claude 3 Opus is used to add a new safety/UX feature to the codebase: before executing generated Python, the system asks the user to confirm with “y” or “n.” When “n” is entered, it generates alternative code and returns to the confirmation step. The author reports that implementing this feature took only a few minutes and that the updated flow still supports API calls and produces correct outputs after confirmation.

Cornell Notes

The system improves answers by running Python only when it’s beneficial. It first uses a `should_use_code` step to decide whether a question needs computation; if not, it answers directly. If yes, it generates Python, executes it, saves the result as `code_output`, then prompts the model again using the original question plus that output as context (a RAG-like grounding step). Tests show arithmetic self-correction (32 → 36 apples), live data retrieval for Bitcoin price and London weather using APIs, and a finance calculation that required switching from CoinDesk to CoinGecko after an API fetch failed. Claude 3 Opus is then used to add a user-confirmation gate (“y/n”) before code execution and to regenerate code when permission is denied.

How does the workflow decide whether to generate and run Python at all?

It uses a dedicated `should_use_code` function. The model receives a system message instructing it to determine when Python execution would be advantageous for the best answer. The function returns “Yes” when computation is needed (e.g., calculations or similar needs) and “No” when natural language is sufficient, skipping code generation in the “No” case.

What happens after Python is executed successfully?

The executed program’s result is stored in a variable named `code_output`. Then a second prompt asks the model to produce a natural-language response using both the user’s question and the `code_output` as context. This is described as RAG-like: the final answer is grounded in the computed/fetched output rather than guessed.

What evidence shows the system can correct mistakes?

In the apples test, ChatGPT 3.5 initially calculates 32 apples when the expected result is 36. The system then generates new code to fix the issue; after the correction loop, the output becomes 36 and the natural-language response confirms the total.

How does the system work with local models?

The author replaces OpenAI API calls with a local model served via LM Studio. They select “mistol 7B open Hermes 2.5,” start a server, and rerun the same logic. The system then performs tasks like Bitcoin price lookup and London weather forecasting by generating and executing code locally, including installing required packages and using API keys.

Why did the Bitcoin one-year calculation require switching APIs?

A CoinDesk-based attempt fails with “unable to fetch historical price,” suggesting an API or endpoint issue. Switching to CoinGecko (and using ChatGPT 3.5 turbo) produces an immediate answer: one-year-ago cost around $24,178 and a percentage gain of about 195%. The author cross-checks the one-year chart to validate the plausibility.

What new feature did Claude 3 Opus add, and how does it change execution behavior?

Claude 3 Opus adds a confirmation step before executing generated Python. The updated code asks the user to confirm with “y” or “n.” If the user enters “n,” the system generates alternative code and returns to the confirmation stage, preventing unwanted execution while still enabling iterative improvement.

Review Questions

Describe the two-stage flow of the system (decision step vs. code-execution-and-grounding step). What variables are used to pass information forward?
Give one example where an API failure occurred and explain what change fixed it.
How does the “y/n” confirmation feature alter the risk profile of executing generated code?

Key Points

1
Use a `should_use_code` decision step to run Python only when computation or external data would improve the answer.
2
Generate Python, execute it, and store results in `code_output` before asking the model for the final natural-language response.
3
Treat `code_output` as context in a RAG-like second prompt so the model grounds its answer in computed or fetched values.
4
Local inference is feasible by swapping API calls for a local LM Studio model server while keeping the same orchestration logic.
5
API reliability matters: when CoinDesk historical price fetching fails, switching to CoinGecko can restore correct calculations.
6
Claude 3 Opus can be used to extend the orchestration code quickly—such as adding a user confirmation gate before executing generated code.

Highlights

A single gating function (`should_use_code`) determines whether Python execution is worth it, preventing unnecessary tooling on simple questions.

The system’s “RAG-like” step feeds `code_output` back into the model, turning computations and API results into grounded final answers.

When CoinDesk couldn’t fetch historical Bitcoin prices, switching to CoinGecko produced a plausible one-year gain (~195%) that matched a chart check.

Claude 3 Opus enabled a safety upgrade: the code now asks for “y/n” permission before running, and regenerates alternatives when permission is denied.

Topics

Local LLM Orchestration
Code Interpreter Pattern
RAG-Like Grounding
API Data Fetching
Claude 3 Opus Coding

Mentioned

LM Studio
OpenWeatherMap
CoinDesk
CoinGecko
Anthropic Claude 3 Opus
ChatGPT
RAG
API