Autonomous Open Source LLM Evaluator (Ollama) - Full Guide
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The system runs a loop over a configurable list of LLMs, prompting each to generate a plan and a solution for the same problem.
Briefing
A lightweight “autonomous evaluator” workflow can automatically compare multiple open-source LLMs on the same task, then use a stronger model (GPT-4 Turbo) to judge which model performed best. The practical payoff is simple: instead of guessing which model to use for a specific problem, the system generates candidate solutions from a list of models, collects their outputs, and runs an evaluation pass that selects a winner based on correctness and reasoning quality.
The setup starts with a user-defined problem stored as text and a configurable list of models (examples mentioned include Mistral, “Cod STW f3e,” Llama 3, Gemma 7B, MagicCoder, and “quins 14b”). For each model in the list, the workflow prompts the model to create a step-by-step plan, then asks it to solve the problem. Each model’s answer is saved, producing a set of candidate responses. Once all models have responded, GPT-4 Turbo is prompted to evaluate each candidate using criteria the user can tune—whether the answer is correct, how clear the plan and reasoning are, and whether the solution matches the expected result. The evaluator then selects the best-performing model and prints both the winning model’s answer and the evaluation summary.
A key detail is that the evaluator is not limited to plain text tasks. A second “code version” extends the same idea to programming problems by allowing the system to execute generated code and capture runtime results for evaluation. In that mode, the workflow can assess not just whether the model’s explanation sounds right, but whether the code actually runs and produces the expected output.
The transcript demonstrates the text workflow with a logic question: “Kaye has three brothers; each of her brothers has two sisters. How many sisters does Kaye have?” Several models produce incorrect interpretations (for example, one claims Kaye has two sisters, another concludes three). GPT-4 Turbo’s evaluation identifies Mistral as the best performer because it correctly concludes Kaye has one sister and provides reasoning that matches the underlying relationships.
The code workflow is tested with a Python task: sort a list of book numbers from low to high using bubble sort. Multiple models generate code; at least one produces a syntax error, while others generate working implementations. GPT-4 Turbo’s judgment turns up an unexpected winner: F3, which produced a detailed bubble sort explanation, correct step-by-step algorithm description, and a Python implementation that yields the correct sorted output. Another model, MagicCoder, also solves the task correctly, but GPT-4 Turbo still ranks F3 higher for this specific evaluation.
Overall, the approach functions like an automated model “tryout” harness: generate solutions across models, verify correctness (and optionally execute code), then let GPT-4 Turbo act as the adjudicator to pick the best model for that exact task type.
Cornell Notes
The workflow builds an autonomous evaluator that compares several open-source LLMs on the same problem and then uses GPT-4 Turbo to judge which model performed best. It runs a loop: each candidate model generates a step-by-step plan and an answer, and the system stores every response. After collecting outputs, GPT-4 Turbo evaluates correctness and reasoning quality and selects a winner, optionally printing the best answer and an evaluation report. A code-focused variant can execute generated Python code and evaluate based on runtime results, not just textual plausibility. Demonstrations include a logic question where Mistral is ranked best, and a bubble-sort coding task where F3 is selected despite another model also producing correct code.
How does the evaluator decide which model “wins” on a task?
What’s the difference between the text version and the code version?
Why did Mistral win the logic-question example?
What made the bubble-sort example surprising?
What role does “plan generation” play in the workflow?
Review Questions
- In the text workflow, what inputs does GPT-4 Turbo receive to evaluate each model’s performance?
- How does executing code change the evaluation compared with judging only the written answer?
- Give one example from the transcript where a model produced an incorrect result, and explain why the winning model was preferred.
Key Points
- 1
The system runs a loop over a configurable list of LLMs, prompting each to generate a plan and a solution for the same problem.
- 2
All candidate answers are stored, then GPT-4 Turbo is used as an adjudicator to evaluate correctness and reasoning quality.
- 3
The evaluator can be tuned with parameters that affect how GPT-4 Turbo scores responses.
- 4
A code-focused variant can execute generated Python code and evaluate based on runtime output and errors.
- 5
In the logic example about Kaye’s sisters, GPT-4 Turbo selected Mistral as the best model because it produced the correct count with accurate reasoning.
- 6
In the bubble-sort example, F3 was ranked best even though another model also produced correct sorted output, showing that evaluation criteria can favor explanation depth and implementation details.
- 7
The approach is designed for practical model selection when the best model for a given task is uncertain.