Anthropic's post-mortem is a critical analysis of the challenges in productionizing large-scale models. The Triton JIT compiler bug causing cancellation in log-sum-exp is particularly telling. It signals that our compiler toolchains, often built on MLIR dialects, lack the robust formal verification needed for low-precision formats like FP8 (E4M3/E5M2). Ensuring gradient fidelity post-quantization isn't just a software problem; it necessitates rethinking ALU microarchitectures and using SAT/SMT solvers to provably guarantee numerical stability in fused kernels. The tokenizer and canary issues highlight a failure in managing IID assumptions. The subtle padding difference created a covariate shift between training and inference, a classic data-centric failure mode. Furthermore, canarying stochastic models with high Kolmogorov complexity using simple A/B tests is becomes pointless. It fails to detect latent issues like representational collapse. True observability requires more sophisticated techniques, such as monitoring the Mahalanobis distance in the embedding space to detect semantic drift in real-time. These incidents prove that a siloed approach is obsolete. We need a new discipline of full-stack AI engineering, where optimizations are co-designed from the gate-level of the silicon and its on-chip network (NoC) topology all the way up to the distributed serving stack. https://lnkd.in/gmh22yRV
Tal Stavi’s Post
More Relevant Posts
-
Anthropic’s Postmortem: 3 Infra Bugs That Broke Claude Anthropic published a postmortem on three infra bugs that degraded Claude’s quality (Aug–Sep). Bugs: 1️⃣ Context Window Routing Error – Sonnet 4 requests misrouted after a load-balancing change. 2️⃣ Output Corruption – TPU misconfig broke a runtime optimization → garbled tokens. 3️⃣ XLA:TPU Miscompilation – A latent compiler bug surfaced, distorting token selection. Because the bugs overlapped, users saw different failures at different rates. Diagnosis was hard. The gaps exposed limits in their evaluation pipeline. Next steps: continuous production evals + faster privacy-safe debugging tools. Good read for anyone in MLOps, SRE, AI infra—real challenges of keeping models reliable across GPUs and TPUs. https://lnkd.in/dWdG-nKQ
To view or add a comment, sign in
-
Here’s what’s so crucial about evals and monitoring in AI: software and/or infrastructure changes can have meaningful impact on the ACCURACY of the models - not just on latency or other performance metrics. I find this a bit mind blowing, especially when you start to think about how complex it is for enterprises that decide to run their own models. One of the problems Anthropic had occurred after a runtime optimization was deployed and the result was that some people saw Thai or Chinese words inserted into their English responses. This post mortem is really fascinating. It’s a snapshot in time of how model providers are figuring out monitoring best practices as they go. #AIEvals
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP
To view or add a comment, sign in
-
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP
To view or add a comment, sign in
-
Incredibly insightful post. The deep dive on the bf16/fp32 floating point mismatch and how the latent XLA:TPU compiler bug was masking precision issues is a huge infrastructure lesson. Balancing performance (approximate top-k) with non-negotiable model quality is clearly the real scaling challenge. Thank you Todd Underwood for the detailed post!
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP
To view or add a comment, sign in
-
Postmortem by Anthropic; why there were some particularly erroneous results from Claude in August and early September, and in useful detail, what it took to find and correct these particular bugs. Puts some scope around the 'all responses should be regarded as unreliable' expectation.
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP
To view or add a comment, sign in
-
A very candid share and a great source of learning on model serving reliability. Feedback, especially well-structured feedback is key, as correctly pointed out. What deserves more emphasis as a corrective action is the role of observational analytics, using LLMs themselves for causal inference. Imagine neural pathways forming and activating as certain failure signals begin to connect.
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP
To view or add a comment, sign in
-
The transparency in sharing root causes and fixes is the gold standard for AI product reliability. Kudos to the Anthropic team!. It's a fantastic read! Some core technical lessons stood out: • Infra & software issues are quality problems in complex ML systems Load balancing issues, or compiler flags can directly degrade model outputs and not just latency. • Sampling matters: top-k, top-p, temperature are not "set-and-forget" Approximate implementations or precision mismatches (bf16 vs fp32) can drop high probability tokens, producing worse outputs. • Optimization is not always improvement Runtime or hardware optimizations must be tested rigorously. A small efficiency improvement can corrupt outputs if not validated end-to-end. • User privacy is non-negotiable, but it complicates debugging Limited access to production interactions means issues are harder to reproduce & strong monitoring and synthetic benchmarks become critical. • Consistency across environments is key The same model should behave identically regardless of platform (GCP, AWS, Anthropic native ), chip type (AWS Trainium, GPUs, TPUs), or server configuration. 👉 As ML systems scale, 'observability for quality' becomes critical #MLEngineering #Reliability #Infrastructure
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP
To view or add a comment, sign in
-
Anthropic shared a detailed breakdown of three recent production incidents while serving LLMs at massive scale (billions of requests). https://lnkd.in/gNBfBEMM The issues highlight how difficult and unpredictable ML infrastructure can be when models are both stochastic and resource-intensive. Key Issues Encountered: 1. Load balancing misrouting 2. Precision mismatch on TPUs 3. Unexpected TPU bug – surfaced only after fixing (2), impacting the top-k sampling function. Resolutions: (1) Fix routing logic. (2) Workaround.(3) Rolled back. Impact & Cost: Time from initial report to resolution: over 1 month. Lessons Learned: More comprehensive testing is essential. Smoother and faster channels for user feedback are critical. Despite advances, production ML infra still relies heavily on experienced human engineers—intuition and judgment remain irreplaceable. AI systems can assist, but we are still far from fully automated operations.
To view or add a comment, sign in
-
A case study of transparency and root cause analysis stated plainly by Anthropic. Between August and early September, three infrastructure bugs affected Claude's response quality - at peak impact, 16% of Sonnet 4 requests were degraded. Anthropic's response demonstrates what accountability looks like in AI. The root causes were thoroughly identified: • Routing errors sending requests to wrong server configurations • TPU misconfigurations causing random character corruption • Compiler bugs affecting token generation during text output Their transparency extends to sharing specific technical details and prevention measures: • Enhanced evaluations to distinguish working vs broken implementations • Continuous quality monitoring on production systems • Improved debugging tools that preserve user privacy As Anthropic stated: "We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone." This level of technical transparency and effective root cause analysis sets the standard. Anthropic has consistently demonstrated this approach to their AI models and function. Keep it up. https://lnkd.in/gtMndT4g
To view or add a comment, sign in
-
Anthropic has published a technical report on how they resolved 3 bugs that intermittently degraded Claude's responses. There is a lot of learning if you closely observe the report, where it covers areas like routing logic, output corruption and a challenging XLA compiler bug. It's pretty well understood that they have to keep it at a high level, and their internal report should be covering more detailed actions, like whether it was a faulty service discovery in the routing logic, a stale routing table, or a load balance misconfig, and similarly, the output corruption was due to object serialisation or something else. And the XLA compiler one is a bit not straightforward as it can change the precision during the casting bf16<->fp32. We're completely confident that their internal, deep SRE-level RCA must have captured everything, and they have applied essential guardrails in the form of corrective action to avoid such failures in the future. #Anthropic #RCA #AI #ML #Claude #AiAgent https://lnkd.in/dVT3ssrS
To view or add a comment, sign in
Lead Application Engineer of Rakuten Pay
3dGood share. Approximate top-k defects at distributed systems are super silent.