Many discussions on LinkedIn this week have conflated two distinct issues relating to Language Models. Determinism (reproducibility) and stochastic token sampling (response diversity). These are not related and should not be treated as if they were. A Neural Network, including Language Models are algorithmically deterministic, given fixed weights, the same input, and identical arithmetic operations, its forward pass will always produce the same output. Stochastic decoding methods such as temperature or top-k sampling introduce controlled randomness after this forward pass to generate varied responses. If you fix the random seed, those sampling draws are repeatable, so you can set a higher temperature, say 0.7, and still obtain diverse yet reproducible outputs. It's important to distinguish between sampling parameters and the seed itself. Setting the random seed to a fixed value simply initialises the pseudorandom number generator so that the sequence of draws is reproducible. It does not disable sampling or force greedy decoding. By contrast, setting temperature to 0 eliminates randomness in token selection and yields greedy decoding, which always chooses the highest-probability token. These two settings, seed and temperature, control different aspects of generation and are frequently misunderstood. The true source of irreproducibility at scale lies in the hardware and software stacks used for inference. GPU kernels, parallel reductions, and mixed-precision arithmetic can introduce tiny floating-point differences from run to run. When workloads are batched or distributed across devices, these minor variations can alter the logits slightly and change the sampled token even with a fixed seed. This is hardware-level non-determinism, not a property of the model’s algorithms. These are separate concerns. Seeding and token sampling control algorithmic randomness, while hardware determinism governs whether the underlying probability distribution is stable. Progress toward fully deterministic inference at scale which the Thinking Machine Lab paper presents is therefore a positive step forward. It ensures the forward pass is exactly reproducible, yet it places no limits on stochastic sampling. You can still use temperature, nucleus sampling, or any other decoding strategy, now with the added confidence that fixed seeds will behave predictably. Recognising this distinction matters. #artificialintelligence #machinelearning #languagemodels
I still challenge the small floating point differences if related to VRAM of the GPUs of which I take your meaning about mixed precision and understand from the thinking machine blog is their goal is solveable if everything is a double or double double and so on. I’m not going to be awful about it I will await the results. I’m just massively surprised if no one has already tried that because A) as a C++ programmer you know and this comes up a lot. B) the IDE in most cases tells you every moment of every day where you have mismatched precision. Thereafter it’s a find and replace task to go into header files and so I just want to add if thinking machines are correct it kind of is a bit foolish of the other companies to have not fixed the precision issue and kind of startling no one already tried this. If though it is related to GPU architecture and the inner workings of CUDA squashing precision during GPU calculations I could see why that would be a issue. Though then I don’t see how thinking machines can fix this in NVIDIA code. I also point out this would not be an unalloyed good as VRAM is the trainning bottleneck so it would lower trainning speeds. I really do not see the 2B opportunity. But I might be being a d**** here so…
Great post and beautifully laid out.
Nicely laid out
But this has no effect on accuracy or error rates. This only fixes the reproduciblity of right, wrong or “I don’t know” answers. This problem that LLMs are not knowledge models is fundamental and cannot be fixed. Nor are they reasoning engines.
Good points. But, decoding still plays a big role bc of the autoregressive nature. If greedy decoding has a clear argmax then there's no issue. That changes if probabilities are on a knife-edge. In stochastic decoding every step introduces sampling variance. The RNG controls the reproducibility, but the autoregressive loop means that once a single token diverges the rest of the sequence evolves down a different trajectory. Users should prompt a model with the same prompt to get a better sense of this firsthand.
Thank you for sharing this! Very insightful
Assuming we have the ability to fix the seed. Can we?
A system is deterministic if it consistently produces the same output for a given input regardless of whether that output is right or wrong. For example, a language model can be a deterministic system, consistently generating a hallucinated answer like "23×47=400." It is fully reproducible, but entirely incorrect. This is the key challenge of reproducibility in probabilistic models. While a batch-invariant kernel (Thinking Machine Lab paper) ensures we can replicate an LLM's behavior across runs, we are simply reproducing a probabilistic outcome. The model, when it fails, fails consistently. A true computational engine is fundamentally different; it guarantees correctness by its very design, not merely consistency. This presents an architectural dilemma. If we rely on external computational engines for every task requiring absolute accuracy, the LLM is reduced to little more than a semantic wrapper. Its role shifts from being a reasoning system in its own right to a mere interface for a deterministic solver. This subverts the core purpose of a generative model and reduces it to a less flexible, more expensive API call.
People are latching on to the announcements from Mira Murati
Since you mentioned this is their side project, I’ll wait for the presumptive Thing that Really Fixes Things [TM]