Determinism in the AI Tech Stack (LLMs): Temperature, Seeds, and Tools

TL;DR

Temperature is a primary lever for repeatability: temperature=0 collapses sampling toward the most likely tokens and reduces run-to-run variation.

Briefing Cornell Notes

Briefing

Determinism in LLM outputs is achievable—but only in specific ways, and often only when randomness controls are paired with “deterministic” software scaffolding. A hands-on set of experiments shows that lowering temperature from 1 to 0 can dramatically stabilize an open-source model’s responses, yet it may still produce wrong answers. The most reliable path in practice comes from combining an LLM with conventional code execution: even when the open model struggles to solve a math equation directly, it can generate executable Python that then produces the correct result.

The tests begin with a baseline “software stack” approach: a Python program computes a target equation and returns a consistent numeric answer (11 16.95). Running the same task through an open-source LLM with temperature set to 1 produces wildly different outputs across repeated runs—72.34, 117.0, 147.450—making the result unusable for anything requiring repeatability. Switching temperature to 0 reduces variability sharply: repeated runs cluster around ~115.8–115.9, showing much stronger consistency even though the answer remains incorrect.

To interpret why, the transcript ties temperature to how token probabilities are sampled. At higher temperature, the model spreads probability mass across multiple next-token options, increasing diversity and creativity but also randomness. At temperature 0, sampling collapses toward the most likely token choices, yielding repeatable outputs—though potentially “boring” and still not guaranteed correct.

A key turning point arrives when a stronger foundation model (GPT-4) is used with temperature set to 0. Unlike the smaller open model, GPT-4 returns the correct equation result (11 16.95) consistently across repeated runs. That contrast frames the core tradeoff: temperature control improves stability, model capability determines correctness.

For the smaller open model (referred to as “f3”), the transcript demonstrates a hybrid strategy. Instead of asking the model to directly compute the equation, it prompts the model to generate Python code for the problem, then executes that code deterministically. With this tool-augmented setup, the system reaches the correct answer even when direct LLM-only attempts fail.

The final section shifts from temperature to OpenAI’s “seed” parameter in chat completions. In a beta feature, repeated requests using the same seed and parameters are intended to sample deterministically “with best effort,” while acknowledging determinism is not guaranteed and backend changes can affect results (tracked via a system fingerprint). In a simple prompt (“best pizza in New York” with a cheese constraint), seeded calls return the same answer repeatedly (attributed to “FAR Pizza in Brooklyn”), while unseeded calls vary.

Overall, the transcript lands on a practical conclusion: LLMs are inherently probabilistic and not designed to behave like traditional deterministic software. But stability improves with temperature=0, and repeatability can be further strengthened with seeds and—most effectively—by adding deterministic tools such as code execution and other grounding techniques to reduce hallucinations. The “hallucination” question is left open as a feature-versus-bug tension: creativity and novelty require some uncertainty, yet many real use cases demand repeatable, precise outputs.

Cornell Notes

LLM outputs can be made more repeatable, but not in the same way as traditional software. In experiments, an open-source model produced highly inconsistent math results at temperature=1, while temperature=0 greatly reduced variation—though the answer was still wrong. GPT-4 with temperature=0 solved the equation correctly and consistently, highlighting that determinism controls don’t replace model capability. The most dependable approach for the weaker model was tool augmentation: prompt it to generate executable Python, then run the code deterministically. The transcript also tests OpenAI’s beta “seed” parameter, finding that seeded requests can return the same answer repeatedly for a simple prompt, even though determinism is not guaranteed and backend changes may alter outputs.

Why does changing temperature from 1 to 0 affect determinism so strongly?

Temperature controls how widely the model spreads probability mass across candidate next tokens. At temperature=1, sampling remains diverse, so repeated runs can pick different tokens and drift into different numeric outputs (e.g., 72.34 → 117.0 → 147.450). At temperature=0, sampling collapses toward the most likely tokens, so repeated runs converge to nearly the same output (e.g., ~115.8–115.9). The transcript links this to the softmax sampling behavior described in a referenced 3Blue1Brown Transformers explanation.

What did the math experiments show about “consistency” versus “correctness”?

Consistency improved with temperature=0, but correctness depended on model strength. The open-source model stayed wrong even when outputs became stable (~115.8–115.9 instead of the correct 11 16.95). GPT-4 with temperature=0 returned the correct result (11 16.95) and repeated it exactly, showing that determinism settings help stability but don’t guarantee the right answer.

How does combining an LLM with deterministic software change the outcome?

When the open model was asked to directly solve the equation, it failed. When prompted to generate Python code (“generate a python code to solve the equation”) and then executed, the system produced the correct answer. The LLM’s role shifted from computation to code generation, while the actual arithmetic happened in deterministic Python execution.

What is OpenAI’s “seed” parameter intended to do, and what limitation remains?

The seed parameter (beta) is meant to make sampling deterministic “with best effort” so that repeated requests with the same seed and parameters return the same result. The transcript emphasizes two caveats: determinism is not guaranteed, and backend changes can affect outputs, which should be monitored via the system fingerprint response parameter.

In the pizza example, how did seeded vs unseeded requests behave?

With a fixed seed (seed=69) and temperature=1, the “best pizza in New York” prompt returned the same answer across repeated calls, consistently pointing to “FAR Pizza in Brooklyn.” Without the seed, the response varied across runs, producing different phrasings and attributions each time.

Review Questions

In the experiments, which factor most directly reduced output variance: temperature, model choice, or tool execution? Explain with one concrete example.
Why can temperature=0 produce stable outputs that are still incorrect?
How does the seed parameter differ from temperature in its effect on repeatability?

Key Points

1
Temperature is a primary lever for repeatability: temperature=0 collapses sampling toward the most likely tokens and reduces run-to-run variation.
2
Lowering temperature improves consistency but does not guarantee correctness; model capability still determines whether the right answer is produced.
3
Hybrid systems—prompting an LLM to generate code and then executing it—can turn probabilistic generation into deterministic computation.
4
Stronger foundation models (e.g., GPT-4) can solve tasks correctly under temperature=0, making LLM-only determinism more practical for some problems.
5
OpenAI’s beta seed parameter can increase repeatability for identical prompts and parameters, but determinism is not guaranteed and backend changes may alter results (track via system fingerprint).
6
Hallucinations are treated as a tradeoff: generative uncertainty can enable creativity, but many applications require grounding and deterministic tooling to reduce errors.

Highlights

At temperature=1, an open-source model’s math output swung dramatically across runs (72.34, 117.0, 147.450), while temperature=0 clustered results tightly (~115.8–115.9).

GPT-4 with temperature=0 returned the correct equation result (11 16.95) consistently, unlike the smaller model.

The most reliable fix for the weaker model was tool use: generate Python with the LLM, then execute it deterministically to get the correct answer.

OpenAI’s beta seed parameter produced identical answers across repeated calls for a simple “best pizza in New York” prompt, while unseeded calls varied.

Topics

Determinism in LLMs
Temperature Sampling
Seeds and Repeatability
Tool-Augmented Generation
Hallucinations vs Precision