Autonomous Open Source LLM Evaluator (Ollama)

TL;DR

The system runs a loop over a configurable list of LLMs, prompting each to generate a plan and a solution for the same problem.

Briefing Cornell Notes

Briefing

A lightweight “autonomous evaluator” workflow can automatically compare multiple open-source LLMs on the same task, then use a stronger model (GPT-4 Turbo) to judge which model performed best. The practical payoff is simple: instead of guessing which model to use for a specific problem, the system generates candidate solutions from a list of models, collects their outputs, and runs an evaluation pass that selects a winner based on correctness and reasoning quality.

The setup starts with a user-defined problem stored as text and a configurable list of models (examples mentioned include Mistral, “Cod STW f3e,” Llama 3, Gemma 7B, MagicCoder, and “quins 14b”). For each model in the list, the workflow prompts the model to create a step-by-step plan, then asks it to solve the problem. Each model’s answer is saved, producing a set of candidate responses. Once all models have responded, GPT-4 Turbo is prompted to evaluate each candidate using criteria the user can tune—whether the answer is correct, how clear the plan and reasoning are, and whether the solution matches the expected result. The evaluator then selects the best-performing model and prints both the winning model’s answer and the evaluation summary.

A key detail is that the evaluator is not limited to plain text tasks. A second “code version” extends the same idea to programming problems by allowing the system to execute generated code and capture runtime results for evaluation. In that mode, the workflow can assess not just whether the model’s explanation sounds right, but whether the code actually runs and produces the expected output.

The transcript demonstrates the text workflow with a logic question: “Kaye has three brothers; each of her brothers has two sisters. How many sisters does Kaye have?” Several models produce incorrect interpretations (for example, one claims Kaye has two sisters, another concludes three). GPT-4 Turbo’s evaluation identifies Mistral as the best performer because it correctly concludes Kaye has one sister and provides reasoning that matches the underlying relationships.

The code workflow is tested with a Python task: sort a list of book numbers from low to high using bubble sort. Multiple models generate code; at least one produces a syntax error, while others generate working implementations. GPT-4 Turbo’s judgment turns up an unexpected winner: F3, which produced a detailed bubble sort explanation, correct step-by-step algorithm description, and a Python implementation that yields the correct sorted output. Another model, MagicCoder, also solves the task correctly, but GPT-4 Turbo still ranks F3 higher for this specific evaluation.

Overall, the approach functions like an automated model “tryout” harness: generate solutions across models, verify correctness (and optionally execute code), then let GPT-4 Turbo act as the adjudicator to pick the best model for that exact task type.

Cornell Notes

The workflow builds an autonomous evaluator that compares several open-source LLMs on the same problem and then uses GPT-4 Turbo to judge which model performed best. It runs a loop: each candidate model generates a step-by-step plan and an answer, and the system stores every response. After collecting outputs, GPT-4 Turbo evaluates correctness and reasoning quality and selects a winner, optionally printing the best answer and an evaluation report. A code-focused variant can execute generated Python code and evaluate based on runtime results, not just textual plausibility. Demonstrations include a logic question where Mistral is ranked best, and a bubble-sort coding task where F3 is selected despite another model also producing correct code.

How does the evaluator decide which model “wins” on a task?

It collects each model’s candidate solution (including a generated plan and final answer), then sends all candidates to GPT-4 Turbo with an evaluation prompt. GPT-4 Turbo scores each model based on user-set criteria such as correctness of the final answer and clarity/accuracy of the reasoning, then selects the best-performing model and returns a winner plus an evaluation summary.

What’s the difference between the text version and the code version?

The text version evaluates answers as text: models generate plans and solutions, and GPT-4 Turbo judges correctness and reasoning. The code version adds execution: generated code can be run, and the captured output (or errors like syntax failures) becomes part of what GPT-4 Turbo uses to evaluate which model performed best.

Why did Mistral win the logic-question example?

In the Kaye brothers/sisters problem, several models miscount the number of sisters (for example, one claims two sisters and another claims three). GPT-4 Turbo’s evaluation selected Mistral because it concluded Kaye has one sister and provided reasoning that matched the relationship structure implied by the prompt.

What made the bubble-sort example surprising?

F3 was ranked the best performer even though at least one other model (MagicCoder) also produced code that sorted the list correctly. GPT-4 Turbo still preferred F3, citing factors like a more detailed explanation of the bubble sort steps and correct Python implementation that produced the expected output.

What role does “plan generation” play in the workflow?

Each model is prompted to produce a step-by-step plan before answering. Those plans are stored alongside the final outputs and become part of what GPT-4 Turbo can evaluate—so a model can be favored not only for the final result but also for having a clear, accurate reasoning path.

Review Questions

In the text workflow, what inputs does GPT-4 Turbo receive to evaluate each model’s performance?
How does executing code change the evaluation compared with judging only the written answer?
Give one example from the transcript where a model produced an incorrect result, and explain why the winning model was preferred.

Key Points

1
The system runs a loop over a configurable list of LLMs, prompting each to generate a plan and a solution for the same problem.
2
All candidate answers are stored, then GPT-4 Turbo is used as an adjudicator to evaluate correctness and reasoning quality.
3
The evaluator can be tuned with parameters that affect how GPT-4 Turbo scores responses.
4
A code-focused variant can execute generated Python code and evaluate based on runtime output and errors.
5
In the logic example about Kaye’s sisters, GPT-4 Turbo selected Mistral as the best model because it produced the correct count with accurate reasoning.
6
In the bubble-sort example, F3 was ranked best even though another model also produced correct sorted output, showing that evaluation criteria can favor explanation depth and implementation details.
7
The approach is designed for practical model selection when the best model for a given task is uncertain.

Highlights

GPT-4 Turbo can act as an automated judge: it evaluates multiple candidate solutions from different LLMs and selects a winner based on correctness and reasoning quality.

The workflow supports both text-only evaluation and a code mode that can execute Python, letting runtime results influence the ranking.

Mistral won the sisters logic problem; F3 won the bubble-sort task—even though another model also solved it correctly—because GPT-4 Turbo preferred its overall solution quality.

Topics

Autonomous LLM Evaluation
Open Source Model Comparison
GPT-4 Turbo Judging
Python Code Execution
Bubble Sort Benchmarking

Mentioned

LLM
GPT
GPT-4
GPT-4 Turbo
API

Autonomous Open Source LLM Evaluator (Ollama) - Full Guide