Mistral AI API - Mixtral 8x7B and Mistral Medium | Tests and First Impression
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Mistral AI’s API setup is straightforward, with chat-completion examples in Python and support for both streaming and non-streaming responses.
Briefing
Mistral AI’s API delivers strong reasoning performance at competitive pricing—especially on tasks where GPT-3.5 often trips up—while also offering a straightforward developer setup with streaming responses and a “safe mode.” In hands-on tests, Mixtral-based models matched GPT-4 on a classic “drying shirts” logic problem, while GPT-3.5 produced a flawed, overly literal time calculation.
The platform setup is simple: a basic web interface with documentation, a Python client example, and chat-completion endpoints that support both streaming and non-streaming output. The API also includes a “safe mode” option and an embeddings model (skipped in these tests). Model selection is presented as three tiers: a tiny model powered by Mistral 7B, a small model powered by Mixtral 8x7B, and a medium model backed by an internal prototype model about which little public detail is provided.
Pricing was checked before testing. The medium model is listed at €7.5 per 1 million tokens (about $8.25), which the tester converts to roughly $0.0825 per 1,000 tokens. The small model’s pricing appears similar to ChatGPT 3.5 Turbo (roughly in the same ballpark after token normalization), making it feel competitive rather than premium.
Three reasoning-and-coding style tests followed. First came the “drying shirts” problem: five shirts dry in 10 hours, then 10 shirts are hung for the next drying cycle. GPT-3.5 incorrectly treated drying time as scaling linearly with the number of shirts, outputting 20 hours for 10 shirts. In contrast, both the Mistral small (Mixtral 8x7B) and the Mistral medium returned the correct logic: if all shirts dry under the same conditions and don’t block each other, 10 shirts still take 10 hours.
Second was a “world model” scenario involving a ball placed into a bag with a hole larger than the ball, then shipped to a friend in London. GPT-3.5 concluded the ball would end up in the sealed box, which contradicts the hole-in-the-bag premise. GPT-4 delivered the expected result: the ball would fall out and be left behind in the sender’s office. The Mistral small model did not reach the same level of certainty or specificity, and the medium model improved the outcome, suggesting the ball likely fell out during transit—though it still wasn’t as clean as GPT-4.
Third came coding: generating a playable Snake game with a Windows UI. The small model produced incomplete code (not the full copyable program), and the resulting game behavior was flawed. The medium model generated a more complete Tkinter-based version with a working UI and score, but it still wasn’t perfect. GPT-4 produced the most reliable result, including correct gameplay behavior.
Finally, streaming output was demonstrated across tiny, small, and medium models, with chunked text arriving quickly. Overall takeaway: Mistral’s API feels easy to integrate, and its Mixtral/Mistral models show notable strengths on reasoning tasks where GPT-3.5 can fail, while GPT-4 remains the benchmark for the toughest coding correctness. The tester ends by planning broader API comparisons beyond OpenAI, while also highlighting interest in the undocumented medium model and future benchmarks.
Cornell Notes
Mistral AI’s API proved easy to use and competitively priced, with chat-completion endpoints that support streaming and a “safe mode.” In reasoning tests, Mixtral 8x7B (the “small” tier) and the Mistral “medium” tier both solved the “drying shirts” problem correctly, while GPT-3.5 gave a wrong linear time answer. On a shipping “world model” puzzle, GPT-4 delivered the most precise conclusion (ball left behind in the sender’s office), and Mistral’s medium improved but didn’t fully match GPT-4’s clarity. For coding a Tkinter Snake game, GPT-4 produced the most reliable full program; Mistral small was incomplete and medium was closer but still imperfect. Streaming worked well across tiers, making the API feel responsive for interactive use.
Why did GPT-3.5 fail the “drying shirts” problem, and what did the Mistral models do differently?
In the ball-and-bag shipping puzzle, what conclusion did GPT-4 reach that GPT-3.5 missed?
How did Mistral small and Mistral medium perform on the shipping puzzle compared with GPT-4?
What went wrong when generating the Snake game with Mistral small, and what improved with Mistral medium?
What does the streaming test reveal about the API’s developer experience?
Review Questions
- Which assumption about drying conditions makes the correct “10 shirts take 10 hours” answer possible, and how did GPT-3.5 violate it?
- In the shipping puzzle, what role does the hole-in-the-bag play in determining where the ball ends up?
- Compare the failure modes of Mistral small vs Mistral medium on the Snake game: what was incomplete, and what became functional?
Key Points
- 1
Mistral AI’s API setup is straightforward, with chat-completion examples in Python and support for both streaming and non-streaming responses.
- 2
Model tiers are organized as tiny (Mistral 7B), small (Mixtral 8x7B), and medium (internal prototype), with the medium tier’s details largely undocumented.
- 3
On the drying shirts logic problem, Mixtral 8x7B and Mistral medium returned the correct parallel-drying result (10 hours for 10 shirts), while GPT-3.5 returned an incorrect linear scaling answer (20 hours).
- 4
On the ball-and-bag world-model puzzle, GPT-4 produced the most precise conclusion (ball left behind in the sender’s office), while Mistral small was less specific and Mistral medium improved but still didn’t fully match GPT-4’s clarity.
- 5
For coding a Tkinter Snake game, GPT-4 produced the most reliable full program; Mistral small returned incomplete code, and Mistral medium produced a closer but still imperfect version.
- 6
Streaming responses across tiny, small, and medium tiers appear fast enough to support interactive use cases.
- 7
Pricing for the small tier appears competitive with ChatGPT 3.5 Turbo after token normalization, while the medium tier is higher but still positioned as usable for testing and iteration.