Tal Stavi’s Post

VP R&D @ Cybereason

Anthropic's post-mortem is a critical analysis of the challenges in productionizing large-scale models. The Triton JIT compiler bug causing cancellation in log-sum-exp is particularly telling. It signals that our compiler toolchains, often built on MLIR dialects, lack the robust formal verification needed for low-precision formats like FP8 (E4M3/E5M2). Ensuring gradient fidelity post-quantization isn't just a software problem; it necessitates rethinking ALU microarchitectures and using SAT/SMT solvers to provably guarantee numerical stability in fused kernels. The tokenizer and canary issues highlight a failure in managing IID assumptions. The subtle padding difference created a covariate shift between training and inference, a classic data-centric failure mode. Furthermore, canarying stochastic models with high Kolmogorov complexity using simple A/B tests is becomes pointless. It fails to detect latent issues like representational collapse. True observability requires more sophisticated techniques, such as monitoring the Mahalanobis distance in the embedding space to detect semantic drift in real-time. These incidents prove that a siloed approach is obsolete. We need a new discipline of full-stack AI engineering, where optimizations are co-designed from the gate-level of the silicon and its on-chip network (NoC) topology all the way up to the distributed serving stack. https://lnkd.in/gmh22yRV

A postmortem of three recent issues anthropic.com

1 Comment

Abby Chau

Lead Application Engineer of Rakuten Pay

Good share. Approximate top-k defects at distributed systems are super silent.

To view or add a comment, sign in

More Relevant Posts

Harshkumar Vekariya

Senior Software AI Engineer @ Apexon Labs | Ex-Accenture
6d Edited
Report this post
Anthropic’s Postmortem: 3 Infra Bugs That Broke Claude Anthropic published a postmortem on three infra bugs that degraded Claude’s quality (Aug–Sep). Bugs: 1️⃣ Context Window Routing Error – Sonnet 4 requests misrouted after a load-balancing change. 2️⃣ Output Corruption – TPU misconfig broke a runtime optimization → garbled tokens. 3️⃣ XLA:TPU Miscompilation – A latent compiler bug surfaced, distorting token selection. Because the bugs overlapped, users saw different failures at different rates. Diagnosis was hard. The gaps exposed limits in their evaluation pipeline. Next steps: continuous production evals + faster privacy-safe debugging tools. Good read for anyone in MLOps, SRE, AI infra—real challenges of keeping models reliable across GPUs and TPUs. https://lnkd.in/dWdG-nKQ

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Nancy Gohring

Senor Research Director, AI @ IDC | AI, Market Research, Competitive Intelligence
6d
Report this post
Here’s what’s so crucial about evals and monitoring in AI: software and/or infrastructure changes can have meaningful impact on the ACCURACY of the models - not just on latency or other performance metrics. I find this a bit mind blowing, especially when you start to think about how complex it is for enterprises that decide to run their own models. One of the problems Anthropic had occurred after a runtime optimization was deployed and the result was that some people saw Thai or Chinese words inserted into their English responses. This post mortem is really fascinating. It’s a snapshot in time of how model providers are figuring out monitoring best practices as they go. #AIEvals

Todd Underwood
1w Edited

Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Todd Underwood
1w Edited
Report this post
Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP

A postmortem of three recent issues anthropic.com

12 Comments
Like Comment
To view or add a comment, sign in
Karthik H S

Machine Learning Engineer
6d
Report this post
Incredibly insightful post. The deep dive on the bf16/fp32 floating point mismatch and how the latent XLA:TPU compiler bug was masking precision issues is a huge infrastructure lesson. Balancing performance (approximate top-k) with non-negotiable model quality is clearly the real scaling challenge. Thank you Todd Underwood for the detailed post!

Todd Underwood
1w Edited

Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Anne Johnson
6d
Report this post
Postmortem by Anthropic; why there were some particularly erroneous results from Claude in August and early September, and in useful detail, what it took to find and correct these particular bugs. Puts some scope around the 'all responses should be regarded as unreliable' expectation.

Todd Underwood
1w Edited

Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Ping Yan

Sr Director @ Salesforce | Advancing Analytics and Threat Detection
5d
Report this post
A very candid share and a great source of learning on model serving reliability. Feedback, especially well-structured feedback is key, as correctly pointed out. What deserves more emphasis as a corrective action is the role of observational analytics, using LLMs themselves for causal inference. Imagine neural pathways forming and activating as certain failure signals begin to connect.

Todd Underwood
1w Edited

Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Sachin Kalsi

Data Science Architect | Language Modeling
2d
Report this post
The transparency in sharing root causes and fixes is the gold standard for AI product reliability. Kudos to the Anthropic team!. It's a fantastic read! Some core technical lessons stood out: • Infra & software issues are quality problems in complex ML systems Load balancing issues, or compiler flags can directly degrade model outputs and not just latency. • Sampling matters: top-k, top-p, temperature are not "set-and-forget" Approximate implementations or precision mismatches (bf16 vs fp32) can drop high probability tokens, producing worse outputs. • Optimization is not always improvement Runtime or hardware optimizations must be tested rigorously. A small efficiency improvement can corrupt outputs if not validated end-to-end. • User privacy is non-negotiable, but it complicates debugging Limited access to production interactions means issues are harder to reproduce & strong monitoring and synthetic benchmarks become critical. • Consistency across environments is key The same model should behave identically regardless of platform (GCP, AWS, Anthropic native ), chip type (AWS Trainium, GPUs, TPUs), or server configuration. 👉 As ML systems scale, 'observability for quality' becomes critical #MLEngineering #Reliability #Infrastructure

Todd Underwood
1w Edited

Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post. For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well. There are a bunch of big takeaway lessons here about testing and monitoring and we're working on those, of course. But during this period our models were routinely showing no quality problems in benchmarking (full SWEBench repeated, for example) because the problems were intermittent and also in one case caused by traffic routing in the production system. Detecting these problems in production continuously is very, very hard. We don't purposefully degrade models in production. But the serving software systems are continuously maintained and updated and in this case we can see the ways that that can impact users. I want to say publicly: it's been a rough summer for us, reliability wise. Prior to this set of issues we had previously had capacity and reliability problems throughout much of July and August (those are getting somewhat better now due to focused work by dozens of people). I'm very sorry for the problems and we're working hard to bring you the best models at the highest level of quality and availability we can. https://lnkd.in/eDaRFujP

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Xuan Tan

Principal Machine Learning Engineer @ Workday | Machine Learning, NLP
2d
Report this post
Anthropic shared a detailed breakdown of three recent production incidents while serving LLMs at massive scale (billions of requests). https://lnkd.in/gNBfBEMM The issues highlight how difficult and unpredictable ML infrastructure can be when models are both stochastic and resource-intensive. Key Issues Encountered: 1. Load balancing misrouting 2. Precision mismatch on TPUs 3. Unexpected TPU bug – surfaced only after fixing (2), impacting the top-k sampling function. Resolutions: (1) Fix routing logic. (2) Workaround.(3) Rolled back. Impact & Cost: Time from initial report to resolution: over 1 month. Lessons Learned: More comprehensive testing is essential. Smoother and faster channels for user feedback are critical. Despite advances, production ML infra still relies heavily on experienced human engineers—intuition and judgment remain irreplaceable. AI systems can assist, but we are still far from fully automated operations.

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Rustin Kormos

Network Planning and GIS Manager OSS/BSS Enterprise Applications | GIS & Network Planning | CI/CD & Automation | Data Analytics
6d
Report this post
A case study of transparency and root cause analysis stated plainly by Anthropic. Between August and early September, three infrastructure bugs affected Claude's response quality - at peak impact, 16% of Sonnet 4 requests were degraded. Anthropic's response demonstrates what accountability looks like in AI. The root causes were thoroughly identified: • Routing errors sending requests to wrong server configurations • TPU misconfigurations causing random character corruption • Compiler bugs affecting token generation during text output Their transparency extends to sharing specific technical details and prevention measures: • Enhanced evaluations to distinguish working vs broken implementations • Continuous quality monitoring on production systems • Improved debugging tools that preserve user privacy As Anthropic stated: "We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone." This level of technical transparency and effective root cause analysis sets the standard. Anthropic has consistently demonstrated this approach to their AI models and function. Keep it up. https://lnkd.in/gtMndT4g

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in
Prathap Chowdry

SVP - Head Product Engineering | AI/ML | SaaS | PaaS | B2B | Gen AI | Platform Engineering
3d Edited
Report this post
Anthropic has published a technical report on how they resolved 3 bugs that intermittently degraded Claude's responses. There is a lot of learning if you closely observe the report, where it covers areas like routing logic, output corruption and a challenging XLA compiler bug. It's pretty well understood that they have to keep it at a high level, and their internal report should be covering more detailed actions, like whether it was a faulty service discovery in the routing logic, a stale routing table, or a load balance misconfig, and similarly, the output corruption was due to object serialisation or something else. And the XLA compiler one is a bit not straightforward as it can change the precision during the casting bf16<->fp32. We're completely confident that their internal, deep SRE-level RCA must have captured everything, and they have applied essential guardrails in the form of corrective action to avoid such failures in the future. #Anthropic #RCA #AI #ML #Claude #AiAgent https://lnkd.in/dVT3ssrS

A postmortem of three recent issues anthropic.com
Like Comment
To view or add a comment, sign in

2,359 followers

93 Posts

View Profile Follow

LinkedIn respects your privacy

Tal Stavi’s Post

Explore content categories