Comparing LLMs with LangChain
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use LangChain’s Model Laboratory to run the same prompt set across multiple candidate models with controlled decoding (e.g., low temperature) before deciding what’s production-ready.
Briefing
Choosing a “good for production” large language model isn’t about picking the biggest name—it’s about matching model behavior to the task. A practical way to do that is to run the same prompts across multiple models (with controlled settings like low temperature) and compare outputs side-by-side. The results show that some models handle structured reasoning and fact extraction reliably, while others drift into verbosity, arithmetic mistakes, or even endless loops—differences that matter when you’re trying to reduce cost and risk in real deployments.
The comparison uses LangChain’s Model Laboratory to test about seven models from OpenAI, Cohere, and Hugging Face. Temperature is kept low to limit randomness, and the same prompt templates are reused so performance differences reflect model capability rather than prompt variation. In one simple “opposite of up” question, ChatGPT-style responses produce long step-by-step reasoning, while older completion-style models answer more directly. Some models get the right answer but add excessive or strange reasoning, which can be a problem if you need concise, dependable outputs.
More telling gaps appear in multi-step reasoning and math word problems. For a cafeteria apples problem (starting from 23 apples, using 20 for lunch, then adding 6), the FLAN instruction-tuned models perform best, producing the correct final number with coherent intermediate steps. Other models either skip critical steps (leading to wrong arithmetic), make logic errors, or spiral into repetitive output. The smaller FLAN variant is weaker than the larger one, suggesting that instruction tuning helps, but capacity still matters.
A third test probes knowledge-and-logic constraints using a question about whether Jeffrey Hinton can have a conversation with George Washington. Some models reason through living/deceased status and time overlap more effectively than others, but even strong performers can fail when they rely on incorrect factual assumptions (e.g., wrong birth/death years). The takeaway is that “reasoning” isn’t enough if the underlying facts are wrong; prompt design and model selection both affect reliability.
When the prompts shift toward creativity (telling a story) and common-sense physics (a bicycle mirror scenario), behavior diverges again. Some models generate story-like content but get stuck in loops or lose key plot constraints. For the bicycle/mirror question, larger instruction-tuned models do better at identifying a stationary-bike scenario, while smaller or completion-style models miss the intended physical setup.
Finally, fact extraction from a document shows a different pattern: many models can extract a clearly stated entity (like “OnePlus COO”) accurately. But when the task becomes vaguer—extracting a specific concept such as “foldables supply chain innovation”—models often drift into definitions or hallucinated expansions rather than pulling tightly from the provided context. Across these tests, the most cost-effective approach emerges: use cheaper models for tasks they handle well (like straightforward extraction), and reserve more capable models for reasoning-heavy or constraint-sensitive workflows.
Overall, the comparison framework turns model selection into an empirical process: plug in candidate models, run a suite of representative prompts, and choose based on which failure modes (verbosity, arithmetic errors, hallucinations, repetition loops) are acceptable for the target production use case.
Cornell Notes
Model choice for production should be task-specific, not brand-specific. Using LangChain’s Model Laboratory, the same prompts were sent to multiple OpenAI, Cohere, and Hugging Face models with low temperature, revealing consistent patterns: instruction-tuned FLAN models tended to handle multi-step reasoning and math word problems best, while older completion-style models often skipped steps or produced incorrect arithmetic. In knowledge-and-logic questions, some models reasoned well about living/deceased constraints but still failed when factual details were wrong. Fact extraction was easier for many models when the target entity was explicit, but vaguer extraction requests led to definitions and hallucinated content instead of grounded answers. The practical result: cheaper models can work for simpler extraction, while reasoning-heavy tasks benefit from stronger instruction-tuned models.
Why does the comparison keep temperature low, and what does that change in the results?
What failure mode shows up clearly in the cafeteria apples math problem?
How do models differ on the Jeffrey Hinton vs. George Washington conversation question?
Why does story generation produce inconsistent results across models?
What distinguishes easy fact extraction from hard extraction in the tests?
What practical production strategy emerges from these tests?
Review Questions
- Which specific prompt categories in the tests best reveal differences in reasoning reliability (math, constraint logic, creativity, common-sense physics, fact extraction)?
- In the cafeteria apples problem, what exact intermediate step is required to avoid the common arithmetic mistake?
- For vaguer extraction tasks like “foldables supply chain innovation,” what kinds of incorrect outputs do models tend to produce, and why does that matter for production?
Key Points
- 1
Use LangChain’s Model Laboratory to run the same prompt set across multiple candidate models with controlled decoding (e.g., low temperature) before deciding what’s production-ready.
- 2
Instruction-tuned models (notably FLAN variants) tend to outperform completion-style models on multi-step reasoning and arithmetic word problems.
- 3
Correct reasoning can still fail if the model’s underlying factual assumptions are wrong, as seen in the Jeffrey Hinton vs. George Washington constraint question.
- 4
Creative and open-ended prompts can trigger repetition loops; decoding controls like repetition penalties may be necessary for stable story generation.
- 5
Fact extraction is often easy when the target entity is explicit, but vaguer concept extraction increases hallucination risk and leads to generic definitions instead of context-grounded answers.
- 6
Model selection should be driven by the specific failure modes that are acceptable for the target workflow, not by overall reputation alone.