Use THIS Today to Make Your Local LLM Smarter + Claude 3 Opus Tips
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use a `should_use_code` decision step to run Python only when computation or external data would improve the answer.
Briefing
A practical “local LLM smarter” workflow hinges on one decision: only run Python when a question truly needs computation, then feed the code’s output back into the model as retrieval-like context. The result is a system that can answer ordinary questions normally, but can also calculate, fetch live data, and even self-correct when the first attempt fails—without forcing heavy tooling on every prompt.
The core mechanism starts with a classifier step called `should_use_code`. A system message instructs the model to judge whether Python execution would improve the answer; it returns “Yes” for calculation-heavy queries and “No” when natural language is sufficient. When “Yes” triggers, the system generates Python code, executes it in a terminal, stores the result in a `code_output` variable, and then prompts the model again using both the original user question and the `code_output` as context. The author frames this second stage as similar to RAG: the model doesn’t just guess—it grounds its final response in the computed or fetched output.
The workflow is tested first with ChatGPT 3.5 via the API, using a custom arithmetic check (the transcript mentions a target of 36 apples). An initial run produces an incorrect value (32), but the system then generates corrective code and reaches the correct total of 36 after the self-correction loop.
Next, the same logic is swapped from OpenAI calls to a local model served through LM Studio. Using the “mistol 7B open Hermes 2.5” setup, the system handles live queries that require computation or external data. For “What is the price of Bitcoin today,” it installs needed packages, runs code, and returns a numeric price (the transcript shows 72287) and a natural-language confirmation. For “What is the weather forecast for London this upcoming weekend,” it uses an OpenWeatherMap API key and returns a forecast with specific fields like temperature and humidity (the transcript notes a Fahrenheit-looking temperature and includes wind speed/humidity details).
A more complex finance question—buying one Bitcoin a year ago, then computing today’s value and percentage gain—tests API reliability. A CoinDesk-based attempt fails with “unable to fetch historical price,” prompting a switch to ChatGPT 3.5 and a CoinGecko API. That change produces a plausible result: one-year-ago cost around $24,178 and a gain of about 195%, which the author cross-checks against a one-year chart.
Finally, Claude 3 Opus is used to add a new safety/UX feature to the codebase: before executing generated Python, the system asks the user to confirm with “y” or “n.” When “n” is entered, it generates alternative code and returns to the confirmation step. The author reports that implementing this feature took only a few minutes and that the updated flow still supports API calls and produces correct outputs after confirmation.
Cornell Notes
The system improves answers by running Python only when it’s beneficial. It first uses a `should_use_code` step to decide whether a question needs computation; if not, it answers directly. If yes, it generates Python, executes it, saves the result as `code_output`, then prompts the model again using the original question plus that output as context (a RAG-like grounding step). Tests show arithmetic self-correction (32 → 36 apples), live data retrieval for Bitcoin price and London weather using APIs, and a finance calculation that required switching from CoinDesk to CoinGecko after an API fetch failed. Claude 3 Opus is then used to add a user-confirmation gate (“y/n”) before code execution and to regenerate code when permission is denied.
How does the workflow decide whether to generate and run Python at all?
What happens after Python is executed successfully?
What evidence shows the system can correct mistakes?
How does the system work with local models?
Why did the Bitcoin one-year calculation require switching APIs?
What new feature did Claude 3 Opus add, and how does it change execution behavior?
Review Questions
- Describe the two-stage flow of the system (decision step vs. code-execution-and-grounding step). What variables are used to pass information forward?
- Give one example where an API failure occurred and explain what change fixed it.
- How does the “y/n” confirmation feature alter the risk profile of executing generated code?
Key Points
- 1
Use a `should_use_code` decision step to run Python only when computation or external data would improve the answer.
- 2
Generate Python, execute it, and store results in `code_output` before asking the model for the final natural-language response.
- 3
Treat `code_output` as context in a RAG-like second prompt so the model grounds its answer in computed or fetched values.
- 4
Local inference is feasible by swapping API calls for a local LM Studio model server while keeping the same orchestration logic.
- 5
API reliability matters: when CoinDesk historical price fetching fails, switching to CoinGecko can restore correct calculations.
- 6
Claude 3 Opus can be used to extend the orchestration code quickly—such as adding a user confirmation gate before executing generated code.