Mistral AI API - Mixtral 8x7B and Mistral Medium

TL;DR

Mistral AI’s API setup is straightforward, with chat-completion examples in Python and support for both streaming and non-streaming responses.

Briefing Cornell Notes

Briefing

Mistral AI’s API delivers strong reasoning performance at competitive pricing—especially on tasks where GPT-3.5 often trips up—while also offering a straightforward developer setup with streaming responses and a “safe mode.” In hands-on tests, Mixtral-based models matched GPT-4 on a classic “drying shirts” logic problem, while GPT-3.5 produced a flawed, overly literal time calculation.

The platform setup is simple: a basic web interface with documentation, a Python client example, and chat-completion endpoints that support both streaming and non-streaming output. The API also includes a “safe mode” option and an embeddings model (skipped in these tests). Model selection is presented as three tiers: a tiny model powered by Mistral 7B, a small model powered by Mixtral 8x7B, and a medium model backed by an internal prototype model about which little public detail is provided.

Pricing was checked before testing. The medium model is listed at €7.5 per 1 million tokens (about $8.25), which the tester converts to roughly $0.0825 per 1,000 tokens. The small model’s pricing appears similar to ChatGPT 3.5 Turbo (roughly in the same ballpark after token normalization), making it feel competitive rather than premium.

Three reasoning-and-coding style tests followed. First came the “drying shirts” problem: five shirts dry in 10 hours, then 10 shirts are hung for the next drying cycle. GPT-3.5 incorrectly treated drying time as scaling linearly with the number of shirts, outputting 20 hours for 10 shirts. In contrast, both the Mistral small (Mixtral 8x7B) and the Mistral medium returned the correct logic: if all shirts dry under the same conditions and don’t block each other, 10 shirts still take 10 hours.

Second was a “world model” scenario involving a ball placed into a bag with a hole larger than the ball, then shipped to a friend in London. GPT-3.5 concluded the ball would end up in the sealed box, which contradicts the hole-in-the-bag premise. GPT-4 delivered the expected result: the ball would fall out and be left behind in the sender’s office. The Mistral small model did not reach the same level of certainty or specificity, and the medium model improved the outcome, suggesting the ball likely fell out during transit—though it still wasn’t as clean as GPT-4.

Third came coding: generating a playable Snake game with a Windows UI. The small model produced incomplete code (not the full copyable program), and the resulting game behavior was flawed. The medium model generated a more complete Tkinter-based version with a working UI and score, but it still wasn’t perfect. GPT-4 produced the most reliable result, including correct gameplay behavior.

Finally, streaming output was demonstrated across tiny, small, and medium models, with chunked text arriving quickly. Overall takeaway: Mistral’s API feels easy to integrate, and its Mixtral/Mistral models show notable strengths on reasoning tasks where GPT-3.5 can fail, while GPT-4 remains the benchmark for the toughest coding correctness. The tester ends by planning broader API comparisons beyond OpenAI, while also highlighting interest in the undocumented medium model and future benchmarks.

Cornell Notes

Mistral AI’s API proved easy to use and competitively priced, with chat-completion endpoints that support streaming and a “safe mode.” In reasoning tests, Mixtral 8x7B (the “small” tier) and the Mistral “medium” tier both solved the “drying shirts” problem correctly, while GPT-3.5 gave a wrong linear time answer. On a shipping “world model” puzzle, GPT-4 delivered the most precise conclusion (ball left behind in the sender’s office), and Mistral’s medium improved but didn’t fully match GPT-4’s clarity. For coding a Tkinter Snake game, GPT-4 produced the most reliable full program; Mistral small was incomplete and medium was closer but still imperfect. Streaming worked well across tiers, making the API feel responsive for interactive use.

Why did GPT-3.5 fail the “drying shirts” problem, and what did the Mistral models do differently?

GPT-3.5 treated drying time as if it scaled with the number of shirts, effectively adding up “hours per shirt” and concluding 10 shirts would take 20 hours. The Mistral small (Mixtral 8x7B) and Mistral medium instead applied the key physical assumption: if sunlight and drying conditions are the same and shirts don’t block each other, multiple shirts can dry in parallel. That leads to the correct result—10 shirts still take 10 hours.

In the ball-and-bag shipping puzzle, what conclusion did GPT-4 reach that GPT-3.5 missed?

The scenario includes a bag with a hole larger than the ball. GPT-3.5 incorrectly concluded the ball would end up in the sealed box that arrives in London. GPT-4 concluded the ball would fall out when placed into the bag and would be left behind in the sender’s office (New York), not inside the shipped box.

How did Mistral small and Mistral medium perform on the shipping puzzle compared with GPT-4?

Mistral small did not provide the specific “left behind in the office” outcome; it leaned toward uncertainty such as the ball being lost during shipping or still in the bag. Mistral medium moved closer by suggesting the ball likely fell out during transit, implying it would not be in the London box—but the result still wasn’t as crisp and definitive as GPT-4’s answer.

What went wrong when generating the Snake game with Mistral small, and what improved with Mistral medium?

Mistral small did not return a complete, fully copyable program; it produced a partial, step-by-step style output, so the tester couldn’t run a full correct game. Mistral medium produced a more complete Tkinter-based Snake game with a score display and a working UI, but gameplay wasn’t fully polished (e.g., it still wasn’t perfect compared with GPT-4’s version).

What does the streaming test reveal about the API’s developer experience?

Streaming output delivered text in chunks quickly across tiny, small, and medium tiers. The tester observed that tiny was the fastest, small was also responsive, and medium was slower but still acceptable. The chunked display made the interaction feel more immediate, similar to other streaming chat APIs.

Review Questions

Which assumption about drying conditions makes the correct “10 shirts take 10 hours” answer possible, and how did GPT-3.5 violate it?
In the shipping puzzle, what role does the hole-in-the-bag play in determining where the ball ends up?
Compare the failure modes of Mistral small vs Mistral medium on the Snake game: what was incomplete, and what became functional?

Key Points

1
Mistral AI’s API setup is straightforward, with chat-completion examples in Python and support for both streaming and non-streaming responses.
2
Model tiers are organized as tiny (Mistral 7B), small (Mixtral 8x7B), and medium (internal prototype), with the medium tier’s details largely undocumented.
3
On the drying shirts logic problem, Mixtral 8x7B and Mistral medium returned the correct parallel-drying result (10 hours for 10 shirts), while GPT-3.5 returned an incorrect linear scaling answer (20 hours).
4
On the ball-and-bag world-model puzzle, GPT-4 produced the most precise conclusion (ball left behind in the sender’s office), while Mistral small was less specific and Mistral medium improved but still didn’t fully match GPT-4’s clarity.
5
For coding a Tkinter Snake game, GPT-4 produced the most reliable full program; Mistral small returned incomplete code, and Mistral medium produced a closer but still imperfect version.
6
Streaming responses across tiny, small, and medium tiers appear fast enough to support interactive use cases.
7
Pricing for the small tier appears competitive with ChatGPT 3.5 Turbo after token normalization, while the medium tier is higher but still positioned as usable for testing and iteration.

Highlights

Mixtral 8x7B (Mistral small) solved the drying shirts problem correctly by treating drying as parallel under unchanged conditions—GPT-3.5’s linear time approach was wrong.

GPT-4 nailed the shipping puzzle by leveraging the hole-in-the-bag premise to conclude the ball would be left behind in the sender’s office, not inside the sealed box.

Mistral medium generated a more complete Tkinter Snake game than Mistral small, but GPT-4 still delivered the best overall gameplay behavior.

Streaming output worked smoothly across model tiers, with tiny fastest and medium slower but still responsive.

Topics

Mistral API
Mixtral 8x7B
Reasoning Tests
World Model
Tkinter Snake Game

Mentioned

API
GPT
UI
Tkinter

Mistral AI API - Mixtral 8x7B and Mistral Medium | Tests and First Impression