Mixture of Models (MoM) - SHOCKING Results on Hard LLM Problems!

TL;DR

Mixture-of-Models can improve performance on hard reasoning, but the aggregation strategy determines whether errors get corrected or amplified.

Briefing Cornell Notes

Briefing

Mixture-of-Models (MoM) systems can outperform single-model prompting on hard reasoning tasks, but the gains depend heavily on how the models are combined. Using three architectures—“King” (one strong model synthesizes many advisers), “DuoPoly” (two strong models debate), and “Democracy” (equal-weight voting)—the results show that structured synthesis and adversarial discussion beat simple majority voting, especially on tricky logic and coding problems.

In the “King” setup, multiple smaller LLMs (“peasants”) answer the same prompt independently. Their responses are then fed into a single higher-capability model (“king”), identified as GPT-4 turbo, which produces the final solution using the advisers’ context. The transcript reports strong performance on a classic marble-in-a-cup logic puzzle: the system concluded the marble would fall out before the cup is placed into the microwave, aligning with gravity and the lack of containment once the cup is lifted. Several advisers contributed key physical reasoning, and GPT-4 turbo synthesized those inputs into a step-by-step plan and final answer.

The “King” architecture also handled a simple age reasoning question correctly: Lena ends up 27 years old, not half John’s current age, because the age gap remains three years. For a harder LeetCode-style coding task (sourced from “context 3 93,” described as not in training data), the system produced a solution that passed all tested cases after being run in the LeetCode environment. The only notable miss came from a constrained text task—writing 10 sentences ending with “apples”—where the system produced 9 correct sentences, falling short by one.

The “DuoPoly” architecture replaced single synthesis with debate. Two strong models—GPT-4 turbo and Claude 3 Opus—were prompted to challenge each other’s assumptions using the advisers’ outputs as shared context. This approach improved some areas (the age question stayed correct, and the coding task passed all cases after the solution was tested), but it failed the marble puzzle, landing on an incorrect physical conclusion. The constrained “apples” task also deteriorated, with multiple failures (2 wrong out of 10), suggesting that debate can amplify uncertainty when the underlying reasoning is brittle.

Finally, “Democracy” treated every model as equal and selected the most-voted answer. The transcript’s vote counts show why this can struggle: the marble puzzle’s winning vote still produced the wrong outcome, and the coding task’s most-voted solution came from a smaller model and only partially worked (one case passed, two failed). The age question was the clear bright spot—most models voted for 27 years old (6 votes). For the apples task, the most-voted answer was judged best among the three architectures, though it still wasn’t perfect.

Overall, the transcript frames MoM as a practical engineering pattern: more models help, but the aggregation method matters. Structured synthesis (“King”) delivered the most reliable results across the hardest mix of physics-like logic, arithmetic reasoning, and coding, while equal-weight voting (“Democracy”) and debate (“DuoPoly”) showed more failure modes.

The session also includes implementation details: looping over multiple model calls, collecting adviser outputs, and using system prompts to steer synthesis, debate, or voting. A small UI tracks progress and task timing, and the author shares that the code will be posted to a community GitHub for members, alongside a community Discord. The closing section highlights an unrelated open-source cybersecurity incident-response testing tool that can run on local models and uses LLMs to generate tailored scenarios.

Cornell Notes

Mixture-of-Models (MoM) combines multiple LLM outputs to solve hard tasks, but performance depends on the aggregation strategy. In the “King” design, many smaller models (“peasants”) generate independent answers, and GPT-4 turbo synthesizes them into a final response; this achieved the best overall results, including a correct marble physics puzzle, a correct age puzzle (Lena is 27), and a LeetCode solution that passed all tested cases. “DuoPoly” uses GPT-4 turbo and Claude 3 Opus to debate using adviser context; it fixed the age and coding tasks but failed the marble puzzle and performed worse on the constrained “apples” writing task. “Democracy” uses equal-weight voting; it succeeded on the age question but produced wrong or partially correct answers on the marble and coding tasks. The takeaway: synthesis beats simple voting, and debate can help—or hurt—depending on the problem’s failure modes.

How does the “King” MoM architecture work, and why does it matter for hard reasoning?

The “King” setup runs a set of models on the same user query to produce multiple adviser answers (“peasants”). Those adviser responses are then provided as context to GPT-4 turbo, which generates the final response by combining the advisers’ insights with the original problem. This matters because it turns many independent attempts into a single consolidated plan, reducing the chance that the final answer is based on one model’s isolated mistake. In the marble puzzle, advisers contributed gravity/containment reasoning, and GPT-4 turbo synthesized that into the correct conclusion that the marble would fall out before the cup is placed into the microwave.

What went wrong with “DuoPoly” on the marble puzzle?

In “DuoPoly,” GPT-4 turbo and Claude 3 Opus debate using adviser context, pushing back on assumptions and trying to converge on the best solution. The transcript reports that this debate produced an incorrect marble outcome: the system leaned toward the marble remaining inside the cup, which contradicts the physical setup described (the cup is inverted and then moved without changing orientation, so lifting/placement breaks containment). The age puzzle remained correct, but the marble failure suggests that debate can still converge on a wrong physical model when both sides are influenced by misleading intermediate reasoning.

Why is the age puzzle a useful benchmark across architectures?

The age puzzle is deterministic and checks whether the system preserves the constant age gap. John and Lena’s relationship is given as: when John was 6, Lena was half John’s age, implying Lena was 3 then. The transcript notes that when John is now 30, Lena is 27—because the age difference stays 3 years. All three architectures ultimately produced the correct age answer, with “Democracy” showing the strongest agreement (most models voted for 27 years old). That makes it a good sanity check for whether aggregation preserves arithmetic invariants.

How did the LeetCode coding task outcomes differ between architectures?

For the coding task, “King” produced a solution that passed all tested cases in the LeetCode environment (case 1, 2, and 3 accepted). “DuoPoly” also reached a correct solution after reviewing the chat log and re-running the code, with all cases accepted. “Democracy,” however, selected the most-voted solution and that choice only partially worked: one case passed while two failed. This highlights a key MoM risk: majority vote can elevate a plausible-looking but incorrect approach, especially when models vary in coding reliability.

What does the “apples” constrained writing task reveal about aggregation?

The task required writing 10 sentences ending with the word “apples.” “King” produced 9 correct sentences (one sentence missing), which was a near-miss. “DuoPoly” performed worse, with 2 wrong out of 10. “Democracy” produced the best among the three in the transcript’s comparison: the winning answer had 9 occurrences of “apples” and 1 non-matching sentence. This suggests that constrained-output tasks can be sensitive to how models coordinate—synthesis can be close, debate may introduce more formatting drift, and voting can help pick the most consistent pattern even if it’s not perfect.

What implementation pattern enables these MoM architectures in practice?

All three designs rely on looping over multiple models to generate candidate outputs, then applying a second-stage aggregation step. “King” loops to collect adviser answers, then makes a final GPT-4 turbo call that uses those adviser outputs as context. “DuoPoly” loops to run a multi-turn conversation between GPT-4 turbo and Claude 3 Opus, using adviser insights as shared background, then summarizes the debate into a final answer. “Democracy” loops twice: first to generate candidate answers from each model, then to run a voting prompt for each model over the set of candidate answers, followed by a final counting step to select the most-voted option.

Review Questions

Which aggregation method produced the most reliable results across all four test problems, and what evidence from the marble, age, and coding tasks supports that conclusion?
In “DuoPoly,” how do the debate prompts change the behavior compared with “King,” and why might that lead to a marble-puzzle failure even when the age and coding tasks succeed?
Why can “Democracy” fail on coding problems even when many models vote for the same answer? Identify the transcript’s observed failure mode and relate it to majority-vote risks.

Key Points

1
Mixture-of-Models can improve performance on hard reasoning, but the aggregation strategy determines whether errors get corrected or amplified.
2
The “King” approach—many independent adviser answers synthesized by GPT-4 turbo—delivered the strongest overall results across physics-like logic, arithmetic reasoning, and coding.
3
“DuoPoly” debate between GPT-4 turbo and Claude 3 Opus can help on some tasks (age and coding) but may still converge on incorrect physical reasoning for the marble puzzle.
4
“Democracy” equal-weight voting succeeded on the age puzzle but produced wrong or partially correct outcomes on the marble and LeetCode tasks, showing majority vote is not a guarantee of correctness.
5
Constrained-output tasks (10 sentences ending with “apples”) are sensitive to aggregation: synthesis (“King”) was closest, debate (“DuoPoly”) drifted more, and voting (“Democracy”) selected the most consistent pattern among candidates.
6
A practical MoM implementation pattern is two-stage prompting: generate multiple candidates with a loop, then aggregate with synthesis, debate summarization, or vote counting.
7
The transcript’s test set demonstrates that MoM benefits are task-dependent: deterministic arithmetic can be robust, while physical reasoning and coding can expose aggregation weaknesses.

Highlights

“King” (adviser synthesis by GPT-4 turbo) solved the marble puzzle correctly by combining gravity/containment insights into a single step-by-step plan.

“DuoPoly” debate produced an incorrect marble conclusion even though the age and coding tasks were ultimately correct.

“Democracy” voting nailed the age question (most votes for 27) but failed the marble and partially failed the coding task, illustrating the limits of majority vote.

The constrained “apples” task showed near-miss performance for “King” (9/10) and worse drift for “DuoPoly,” while “Democracy” selected the best candidate among its votes.

Topics

Mixture of Models
Model Aggregation
LLM Reasoning
Ensemble Voting
LLM Coding

Mentioned

MoM
LLM
GPT-4
UX