Understanding Determinism and Stochastic Sampling in Language Models

Head of Artificial Intelligence @ AI-XON | Artificial Intelligence Expert | Author

5d Edited

Many discussions on LinkedIn this week have conflated two distinct issues relating to Language Models. Determinism (reproducibility) and stochastic token sampling (response diversity). These are not related and should not be treated as if they were. A Neural Network, including Language Models are algorithmically deterministic, given fixed weights, the same input, and identical arithmetic operations, its forward pass will always produce the same output. Stochastic decoding methods such as temperature or top-k sampling introduce controlled randomness after this forward pass to generate varied responses. If you fix the random seed, those sampling draws are repeatable, so you can set a higher temperature, say 0.7, and still obtain diverse yet reproducible outputs. It's important to distinguish between sampling parameters and the seed itself. Setting the random seed to a fixed value simply initialises the pseudorandom number generator so that the sequence of draws is reproducible. It does not disable sampling or force greedy decoding. By contrast, setting temperature to 0 eliminates randomness in token selection and yields greedy decoding, which always chooses the highest-probability token. These two settings, seed and temperature, control different aspects of generation and are frequently misunderstood. The true source of irreproducibility at scale lies in the hardware and software stacks used for inference. GPU kernels, parallel reductions, and mixed-precision arithmetic can introduce tiny floating-point differences from run to run. When workloads are batched or distributed across devices, these minor variations can alter the logits slightly and change the sampled token even with a fixed seed. This is hardware-level non-determinism, not a property of the model’s algorithms. These are separate concerns. Seeding and token sampling control algorithmic randomness, while hardware determinism governs whether the underlying probability distribution is stable. Progress toward fully deterministic inference at scale which the Thinking Machine Lab paper presents is therefore a positive step forward. It ensures the forward pass is exactly reproducible, yet it places no limits on stochastic sampling. You can still use temperature, nucleus sampling, or any other decoding strategy, now with the added confidence that fixed seeds will behave predictably. Recognising this distinction matters. #artificialintelligence #machinelearning #languagemodels

34 Comments

David Hsing

Since you mentioned this is their side project, I’ll wait for the presumptive Thing that Really Fixes Things [TM]

Joe Lightfoot

Data Analyst

I still challenge the small floating point differences if related to VRAM of the GPUs of which I take your meaning about mixed precision and understand from the thinking machine blog is their goal is solveable if everything is a double or double double and so on. I’m not going to be awful about it I will await the results. I’m just massively surprised if no one has already tried that because A) as a C++ programmer you know and this comes up a lot. B) the IDE in most cases tells you every moment of every day where you have mismatched precision. Thereafter it’s a find and replace task to go into header files and so I just want to add if thinking machines are correct it kind of is a bit foolish of the other companies to have not fixed the precision issue and kind of startling no one already tried this. If though it is related to GPU architecture and the inner workings of CUDA squashing precision during GPU calculations I could see why that would be a issue. Though then I don’t see how thinking machines can fix this in NVIDIA code. I also point out this would not be an unalloyed good as VRAM is the trainning bottleneck so it would lower trainning speeds. I really do not see the 2B opportunity. But I might be being a d**** here so…

2 Reactions

Brian Mills

Dad | Defense Modeling & Simulation, Wargaming | Founder LVC-Dynamics | USAF Veteran

Great post and beautifully laid out.

2 Reactions

Jon Salisbury

CAIO - CEO @ Nexigen - Ultra Curious, Humble - Cyber Security, Cloud, Smart City, AI, Quantum, Human Centered, Psychology, Leadership. Cooperation, Patience, Encourage, Helpful, Christian, Love!

Nicely laid out

2 Reactions

Richard Self

Leadership and Keynote Speaker and member of the Data Science Research Centre at University of Derby

But this has no effect on accuracy or error rates. This only fixes the reproduciblity of right, wrong or “I don’t know” answers. This problem that LLMs are not knowledge models is fundamental and cannot be fixed. Nor are they reasoning engines.

16 Reactions

Rich Heimann

Sutskever’s List is now available through the Manning Early Access Program!

Good points. But, decoding still plays a big role bc of the autoregressive nature. If greedy decoding has a clear argmax then there's no issue. That changes if probabilities are on a knife-edge. In stochastic decoding every step introduces sampling variance. The RNG controls the reproducibility, but the autoregressive loop means that once a single token diverges the rest of the sequence evolves down a different trajectory. Users should prompt a model with the same prompt to get a better sense of this firsthand.

3 Reactions

Harsh Bohra

MSc Computing Graduate | Software Developer | Specializing in NLP, Web Technologies, and Interdisciplinary Projects.

Thank you for sharing this! Very insightful

1 Reaction

Richard Self

Leadership and Keynote Speaker and member of the Data Science Research Centre at University of Derby

Assuming we have the ability to fix the seed. Can we?

Subhash Nair - MBA (Finance,Strategy), B.Eng.(Computer), Lean 6 σ

A system is deterministic if it consistently produces the same output for a given input regardless of whether that output is right or wrong. For example, a language model can be a deterministic system, consistently generating a hallucinated answer like "23×47=400." It is fully reproducible, but entirely incorrect. This is the key challenge of reproducibility in probabilistic models. While a batch-invariant kernel (Thinking Machine Lab paper) ensures we can replicate an LLM's behavior across runs, we are simply reproducing a probabilistic outcome. The model, when it fails, fails consistently. A true computational engine is fundamentally different; it guarantees correctness by its very design, not merely consistency. This presents an architectural dilemma. If we rely on external computational engines for every task requiring absolute accuracy, the LLM is reduced to little more than a semantic wrapper. Its role shifts from being a reasoning system in its own right to a mere interface for a deterministic solver. This subverts the core purpose of a generative model and reduces it to a less flexible, more expensive API call.

Lorde Astor West

People are latching on to the announcements from Mira Murati

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Christopher Monschau

Senior Project Manager Data & Analytics
1w
Report this post
One of the more interesting takes I’ve read recently in the ML/AI space argues that LLMs shouldn’t be dismissed as mere “stochastic parrots.” At inference time, a trained neural network with fixed weights is a deterministic function: given the same input, it produces the same logits. Applying softmax to those logits yields a categorical probability distribution over tokens. None of this is stochastic by itself. Stochasticity enters only when you sample from that distribution. If you use greedy decoding (argmax) or a fixed beam search, generation is deterministic. If you use sampling-based decoding (e.g., temperature, top-k, or nucleus/top-p), outputs become random by design. Temperature rescales logits; top-k/top-p restrict the candidate set. These are decoding controls for diversity/quality—not fixes for “determinism problems.” Replacing softmax with other functions changes what is being modeled. For example, sigmoid activations correspond to independent Bernoulli outputs (typical for multi-label classification) and don’t enforce probabilities that sum to one, so they are not appropriate for single-label, mutually exclusive token selection in language modeling. Gumbel-Softmax/Concrete is a reparameterization trick that approximates sampling from a categorical distribution while keeping gradients flowing; it’s a training/optimization technique, not a standard replacement for softmax in LLM decoders. Although inference can be deterministic, training is stochastic (random initialization, data shuffling, mini-batching, dropout, etc.), so two models trained with the same recipe will not be bit-identical. As for the “stochastic parrot” critique: LLMs are probabilistic sequence models that learn P(Next token | context) from large corpora. Whether this captures “understanding” is to be debated, but it’s inaccurate to claim they are not statistical or that next-token prediction is not what they do. In my opinion a more precise statement would be: LLMs are probabilistic models whose outputs can be sampled stochastically or produced deterministically, and they often exhibit strong generalization by compressing and leveraging patterns in data. Interesting thought, nevertheless.
7 Comments
Like Comment
To view or add a comment, sign in
Duong Nguyen

Lead Data Scientist | Researcher
1w
Report this post
Last week, Thinking Machine Lab published a novel work on combating nondeterminism in LLM inference. This revived a debate I had long ago about whether LLMs are deterministic at inference. Some people strongly believe that LLM inference is deterministic (e.g. a comment from a LinkedIn Topvoice) Let's revisit that debate. Before jumping into the discussion, let’s define the term. “Deterministic” can mean different things in different domains. Here, I use the common definition in the context of language models: if we provide the same input, the model should always return the same output (assuming the model is stateless). Some people use the greedy sampling (or beam search) to argue that LLMs inference is deterministic. However, can you answer these 2 questions: 1. (The theory part): Let’s say there is a known stochastic process f. Given an input X_t, we know the distribution of the output X_{t+1} = f(X_t). And we use greedy sampling to sample the most probable value x_{t+1} (i.e. the mode of f(X_t). Repeat this process then we can trace a “deterministic path” {x_t, x_{t+1}, …, x_{t+k}, ….} given x_t. But does that make fff deterministic? No. f remains a stochastic process by definition. What’s deterministic here is only the sampling procedure. An LLM mimics the distribution of the data it has been trained on. By definition, it is stochastic. 2. (The practical implementation part): Some may counter: "My definition of LLM inference is different. I say that LLM inference is determnistic because I can use greedy sampling to sample a deterministic path from a trained LLM". Can you???? Have you been able to make your LLM inference deterministic in practice? Even when you can control the hardware and the software version, there are still other sources of randomness in practice that make the result nondeterministic (as explained in the blog post). As someone who has spent considerable (and often futile) effort trying to make my neural nets deterministic, I strongly advocate the position that neural nets in general, and LLMs in particuliar, are nondeterministic, both in theory and in its current practical implementations.
5 Comments
Like Comment
To view or add a comment, sign in
Towards Data Science

643,318 followers
2w Edited
Report this post
Ever tried to build symbolic reasoning into an LLM workflow, only to hit a wall with math-specific logic? Sean M.’s debut TDS article shows how combining GPT-5 with SymPy creates a “Baby AI Gauss” that solves integer sequences using a generate–check–refine loop.

From Tokens to Theorems: Building a Neuro-Symbolic AI Mathematician | Towards Data Science https://towardsdatascience.com
Like Comment
To view or add a comment, sign in
Sierra C.

⚛️Managing Partner | Quantum & Deep Tech Ventures | TEDx Ambassador
4w
Report this post
What if the real limits of AI lie not in compute, but in language itself? Article: https://lnkd.in/gEzM78NU 📰This one is definitely worth a read Say what you will about GPT-5—over-promise, under-deliver—but at the core, large or small, these models are over-reliant on language. And we know even in human relationships, language fails and falls short. The deepest forces of the universe don’t speak in words: 🔹 Math & physics — the true substrate of reality 🔹 Proteins in the body — computation written in biology 🔹 Mycelial networks — living models of connectivity 🔹 Orbital & astronomical phenomena — systems beyond symbolic description Where researchers are looking next: ✨ Scalable — not just bigger, actually relevant and valuable across contexts and size ⚙️ Feasible — grounded in physical laws, not endless GPU cycles to match ROI projections & promises 🔎 Transparent — interpretable, not black-box probabilities hiding behind "IP" asterisks 🧭 Ethical — "move fast and break"... people, systems, and habitats is no longer acceptable 🚀The next breakthroughs won’t come from more tokens and text. They’ll come from reimagining what computation is. 🌌 You can play games with language models—but you can’t game the fabric of reality. #AI #FutureTech #DeepTech #EthicsInAI #Computing #Quantum #Biotech #SystemsThinking #Innovation #TheNextFrontier

What if A.I. Doesn’t Get Much Better Than This? newyorker.com

5 Comments
Like Comment
To view or add a comment, sign in
Srinivasan Sankar

Founder & CEO-AGENTISIQ - AI, But Make It Insurance | Chief Data & AI Advisor | Mckinsey & Accenture Alum | Board, Advisory, Consulting | Innovator
1w Edited
Report this post
Thinking Machines Lab (valuation of $12 billion) founded by former OpenAI chief technology officer Mira Murati in early 2025. They work on scientific understanding of AI and work with the broader research community. Yesterday released this research "Defeating Nondeterminism in LLM Inference" https://lnkd.in/eqHXTBQ9 After thoroughly reviewing the provided paper, I shall elucidate a common audience using an analogy. Alright, imagine you’re at a fast food drive thru. You order a cheeseburger at midnight on a Tuesday, and you think, ‘If I do this the same way again, I’ll get the exact same cheeseburger, right?’ But instead, sometimes they put pickles on it, sometimes they don’t, and sometimes you end up with a taco. That’s what dealing with language models is like. You ask it the same thing, sometimes you get a slightly different answer, and it makes you question reality, your choices, and what exactly is in your cheeseburger. Now, here’s where it gets weird. Even when you set every variable ‘just right’ like you tell the staff, ‘No pickles, ALWAYS cheese, put the sauce in the bottom right corner’ somehow, there’s still a chance your burger ends up looking a little different each time. Why? Because the kitchen is a mess of cooks throwing things together in parallel with everyone working at their own speed, and the order things get put together changes the final flavor. In math, this is called ‘floating point non associativity,’ which is a nerdy way of saying even adding up numbers can give different answers depending on the order. But wait, it gets even stranger! Sometimes, it’s not just the kitchen, but how many people are in the drive thru with you. If the restaurant suddenly gets busy and starts making a ton of orders at once, the way they put together YOUR burger might change. This is called ‘batch size’ in language models, and it means how many questions the AI is answering at the same time can nudge your answer in weird directions. So, is this just chaos? Are we all doomed to live in a world where cheeseburgers and AI models are unpredictable?… Well, not quite. Some very persistent, probably heavily caffeinated engineers figured out how to standardize the process so EVERY burger or AI output comes out the same, no matter the batch, no matter the chaos. But it’s a bit slower, because, you know, order comes at a price. In summary, your AI model is a busy fast food kitchen. If you don’t watch carefully, you’ll get a surprise with every order. But with the right rules, you CAN make it reproducible even if it means waiting a little longer for your cheeseburger.

Defeating Nondeterminism in LLM Inference thinkingmachines.ai

2 Comments
Like Comment
To view or add a comment, sign in
Ilya Vadeiko
1w
Report this post
𝗜𝘀 𝗶𝘁 𝘁𝗿𝘂𝗲 𝘁𝗵𝗮𝘁 "LLM Hallucinations Are Solved" = "Invented Perpetual Motion Machine" For businesses looking to deploy AI in high-stakes environments — like healthcare or finance — this is the ultimate goal. So, here and there, we see a new claim that a startup has "solved" the problem of AI hallucinations. Meanwhile, the academic world continues to produce papers proving that it's impossible, much like the uncertainty principle in Quantum Mechanics. This reminds me of the century-old quest to build a perpetual motion machine: an engine that runs on air, defying the fundamental laws of physics. No patent office even accepts such applications anymore. So, are hallucinations an inherent, thermodynamic-like barrier for LLMs? 🤔 𝗥𝗲𝗮𝗱 𝗱𝗲𝗲𝗽𝗲𝗿: First, we must distinguish between pre-training, post-training (calibration), and context (RAG) related uncertainties. The pre-training category is the most fundamental. Unlike thermodynamics, machine learning is still a relatively new field lacking many empirical validations, but it is governed by statistical principles nevertheless. Recent research, such as the paper from UCL and Harvard titled "The wall confronting large language models" https://lnkd.in/erky9YQm, approaches the problem by evaluating the upper boundary of LLM errors. They argue that the heavy-tailed, non-Gaussian distributions of LLM outputs lead to error amplification that is computationally intractable to eliminate with current scaling approaches, and estimate the upper bound limit for the error. This contrasts with OpenAI's recent, high-resonance paper "Why Language Models Hallucinate." https://lnkd.in/eSeKvice Its practical utility is limited, in my opinion, as it either provides a lower bound on error probability or an upper bound (Theorem 2) that is too abstract. It resembles and references some other very abstract papers relying, for instance, on constructs from Cantor's set theory. The industry's focus on practical mitigation (via RAG, fine-tuning, and guardrails) aligns more with managing the overall goal to increase the efficiency of the whole system in suppressing or attenuating the hallucinations, an inherent statistical property—similar to how we manage noise in quantum physical systems—rather than "solving" it outright. From that point of view, I find more interesting and practical research papers focused on experimental data, e.g. "Hallucination, Monofacts, and Miscalibration: An Empirical Investigation" https://lnkd.in/evn5KMDg. Or, even the earlier OpenAI paper, before they focused so much on the commercial success https://lnkd.in/ehb6umTv #AI #MachineLearning #LLM #Hallucinations #GenerativeAI #Physics #Statistics #TechInnovation #Startups
1 Comment
Like Comment
To view or add a comment, sign in
Seshajalam G

Masters in AIML @25, Student at VIT.
1w
Report this post
🔹 Day 16 of 100 Days of AI – Forward Propagation What is Forward Propagation? Forward Propagation is the process of sending input data through a neural network to generate predictions. Think of it as the “prediction step” in AI models. Steps of Forward Propagation: Input Layer – Feed the raw data (like an image, text, or numerical values). Hidden Layers – Neurons compute: z=w⋅x+bz Apply activation functions (ReLU, Sigmoid, Tanh). Output Layer – Produces final predictions (e.g., classification probabilities). Loss Calculation – Compare predictions with true labels using a loss function (MSE, Cross-Entropy, etc.). Example: Suppose we want to predict if an email is spam or not spam. Input Layer: Words/features from the email. Hidden Layers: Extract patterns like "discount", "free", "urgent". Output Layer: Probability → Spam (1) or Not Spam (0). Where is Forward Propagation Used? · Computer Vision – Classifying images (cat 🐱 vs dog 🐶) · NLP – Predicting sentiment (positive/negative review · Speech Recognition – Converting audio → text · Healthcare AI – Diagnosing diseases from medical scans · Finance – Predicting stock price movements #100DaysOfAI #DeepLearning #GRU #NeuralNetworks #MachineLearning #SequentialModels #AIForEveryone #Python #BuildInPublic #forwardpropagation #linkedInseries
Like Comment
To view or add a comment, sign in
Harshit Gupta

SWE @Dell | Gold Medalist 🏅 | Patents | Python | FastAPI | AI/ML | LLM | 1+ YOE | From Ideas to Intelligent Systems
3d Edited
Report this post
I used to believe the answer to every LLM problem was "bigger models, more data, more compute." After working with models in production, that view has softened. I'm increasingly excited by smarter trade-offs - making models smaller or more task-focused so they can run closer to users, reduce latency, and limit data exposure. 1. Edge-first approaches can be surprisingly effective, but they bring real constraints: limited memory, variable hardware, and a painful update story if you don’t plan for it. 2. Multimodal models add useful context by combining text and images, yet they also increase preprocessing, memory needs, and privacy risk. Techniques like quantisation, distillation, and split edge/cloud inference help a lot, but overdo them and you lose robustness. I’m curious: what’s one trade-off you made when moving an LLM closer to users - and did it pay off? https://lnkd.in/d4SnVGwM #LLM #EdgeAI #ModelCompression #PromptEngineering #TechCommunity #AI #MultimodalAI #RAG

Small Language Models (SLMs) medium.com
Like Comment
To view or add a comment, sign in
John Marsischky

MC Renewables Advanced Energy Storage Systems
3w
Report this post
Credit entirely to Stark Burns expert in data analitics for the excellent analysis below: The AI Hype is a Dead Man Walking. The Math Finally Proves It. For the past two years, the AI industry has been operating on a single, seductive promise: that if we just keep scaling our current models, we'll eventually arrive at AGI. A wave of new research, brilliantly summarized in a recent video analysis, has finally provided the mathematical proof that this promise is a lie. This isn't just another opinion; it's a brutal, two-pronged assault on the very foundations of the current AI paradigm: 1. The Wall of Physics: The first paper reveals a terrifying reality about the economics of reliability. To reduce the error rate of today's LLMs by even a few orders of magnitude—to make them truly trustworthy for enterprise use—would require 10^20 times more computing power. This isn't just a challenge; it's a physical impossibility. We have hit a hard wall where the cost of squeezing out the last few percentage points of reliability is computationally insane. The era of brute-force scaling is over. 2. The Wall of Reason: The second paper is even more damning. It proves that "Chain-of-Thought," the supposed evidence of emergent reasoning in LLMs, is a "brittle mirage". The models aren't reasoning; they are performing a sophisticated pattern-match against their training data. The moment a problem deviates even slightly from that data, the "reasoning" collapses entirely. This confirms what skeptics have been saying all along: we have built a world-class "statistical parrot," not a thinking machine. This is the end of the "Blueprint Battle." The LLM-only blueprint has failed. The path forward is not to build a bigger parrot, but to invest in the hard, foundational research for a new architecture. The future belongs to "world models," like those being pursued by Yann LeCun and others—systems that learn from interacting with a real or virtual world, not just from a library of text. The "disappointing" GPT-5 launch wasn't a stumble; it was the first, visible tremor of this entire architectural paradigm hitting a dead end. The hype is over. Now the real, foundational work of inventing the next paradigm begins.

The wall confronting large language models arxiv.org
Like Comment
To view or add a comment, sign in
Hongyin Luo

Building Subconscious; MIT EECS Ph.D.
4w
Report this post
Language models get more generalizable every day, but agent frameworks are still in the feature engineering era of AI. I never enjoyed designing bag-of-words features again and again for classification tasks in 2013, and I'm tired of designing bag-of-agents workflows for every new feature in 2025. Good news for developers who resonate with my pain: Neural networks have saved us from re-inventing the wheel for classification tasks, and they will free us from re-inventing the wheel for agents. I'm building the NEURAL agent framework at Subconscious with Jack O'Brien. Our goal is saving tons of hours for developers and help them focus on functions and customers, instead of trying a dozen frameworks and ending up building their own infrastructure. We kick off this new direction for AGI: bring your tools for post-trained neural agents. We have solid reasons to commit to it, backed by fun stories that I'm sharing in this blog post. If you enjoy memes, dial [1] for our decorated blog post. if you prefer decks, I have a very dry tech talk [2].

2 Comments
Like Comment
To view or add a comment, sign in

7,601 followers

View Profile Follow

LinkedIn respects your privacy

Understanding Determinism and Stochastic Sampling in Language Models

More from this author

Why Fully Autonomous Software Engineering with AI Remains a Distant Dream

University Final Results

Explore content categories