Did Elon lie again? Grok 4: Separating Fact from Hyperbole in Critiques

Did Elon lie again? Grok 4: Separating Fact from Hyperbole in Critiques

Did Elon lie again? A Balanced Perspective on Grok 4: Separating Fact from Hyperbole in Benchmark Critiques

In a recent Medium article titled “Grok 4 Failed These Benchmarks: Elon Lied Again,” the author, Mehul Gupta, aims xAI’s latest model, Grok 4, accusing it of underperforming on select benchmarks and suggesting that claims about its superiority are exaggerated or outright false.

While constructive criticism is always appreciated in the rapidly evolving world of AI, the article overlooks important context, fails to acknowledge areas where Grok 4 outperforms, and does not recognize the inherent limitations shared by all large language models.

Let’s address these points head-on, providing a more nuanced view based on official announcements and performance data. Let’s start by noting an important caveat: as per xAI’s announcements during the livestream and related communications, Grok 4 has not yet been tuned explicitly for coding tasks, with a dedicated “Grok 4 coding model” promised soon. This means that evaluations involving heavy coding may not yet fully reflect their potential in that domain. It felt like any article critiquing code performance should not leave out this critical detail. 

Follow Rick on LinkedIn or Medium for more enterprise AI and AI insights.

Responding to the Article’s Key Criticisms

The article focuses on Grok 4’s results in two specific benchmarks: LiveBench and a Creative Writing test. According to the author, Grok 4 is “good, but certainly not the best” on LiveBench and “nowhere close to the best AI model” in creative writing, positioning it as “quite average” overall and dismissing claims of it being “smarter than humans.” These points deserve a direct response.

First, on LiveBench: While it’s true that Grok 4 may not top every leaderboard in this benchmark, which tests real-time knowledge and reasoning, this is hardly a “failure.” Benchmarks like LiveBench are dynamic and can vary depending on the specific testing conditions, but Grok 4’s overall architecture, built on scaled reinforcement learning and native tool use, positions it powerfully for such tasks. The article’s framing overlooks that no single model dominates every metric; instead, it cherry-picks data to suggest a lack of capability.

Similarly, the critique of Grok 4’s creative writing capabilities as “average” overlooks the subjective nature of such evaluations. Creative output is notoriously hard to quantify, often depending on prompts, evaluators’ biases, and the model’s training emphases. If Grok 4 appears mid-pack here, it could stem from its focus on reasoning and truth-seeking over pure imaginative flair at this stage. The claim that this makes Elon Musk a “liar” is hyperbolic at best — Elon’s statements about Grok 4 being the “world’s most powerful AI model” refer to its frontier-level capabilities in aggregate, not isolated creative tasks. Dismissing the model as not the “smartest chatbot this quarter” ignores the broader context of its release and ongoing iterations.

While Grok 4 presents an approach that could lead to concerning AI outcomes with minimal performance gains. To address this, during their livestream, xAI announced that they’ve obtained System and Organization Controls 2 (SOC 2) certification to support enterprise sales (this was omitted from the critique). However, this effort might not provide enough value if the underlying technology faces fundamental trustworthiness issues related to its cultural foundations, which will need to be addressed and monitored, but it's too early to dismiss the entire approach. The incident mirrors what happened with the Microsoft Racist ChatBot scandal. Microsoft recovered, and xAI might be redeemed as well.

Where Grok 4 Shines: Highlighting Strengths in Benchmarks

Far from being a blanket failure, Grok 4 (and its variant, Grok 4 Heavy) sets new standards in several rigorous benchmarks, saturating most academic ones and demonstrating unparalleled reasoning. Here’s a breakdown of key areas where it excels, drawn from official xAI data:

Benchmark Grok 4/Grok 4 Heavy Score Comparison/Notes 

  • Humanity’s Last Exam: (text-only subset) 50.7% First model to reach 50%, designed as a “final” closed-ended academic benchmark; surpasses all prior models. 
  • USAMO’25 (USA Mathematical Olympiad): 61.9% Leads the field, showcasing advanced mathematical reasoning. 
  • ARC-AGI V2: 15.9% Nearly doubles Claude Opus’s ~8.6%, with an +8pp gain over previous highs; excels in abstract reasoning and generalization. 
  • Vending-Bench (agentic task): $4,694.15 net worth, 4,569 units sold. Vastly outperforms Claude Opus 4 ($2,077.41, 1,412 units) and humans ($844.05, 344 units); highlights strong performance in simulated real-world scenarios.

Article content

These results underscore Grok 4’s strengths in complex reasoning, tool use, and multimodal understanding, with a massive 256,000-token context window enabling handling of intricate, long-form problems. Unlike the article’s focus on narrower failures, these benchmarks reveal Grok 4 as a leader in frontier AI, particularly in areas requiring adaptability and intelligence beyond rote memorization.

The Broader Issue: Inherent Biases in All AI Models

One critical oversight in critiques like this is the shared limitation of all large language models: they are trained on vast amounts of internet data, which is inherently shaped by what mainstream media publishes and the dominant online sources. This creates an echo chamber where prevailing narratives — often skewed toward certain ideologies — dominate. For instance, Wikipedia, a key training data source for many models, has been increasingly influenced by left/progressive voices, leading to biases in topics such as politics, history, and science. Editors with these leanings have taken over key pages, resulting in slanted representations that seep into AI outputs. This is not to say that left/progressive viewpoints are wrong, but that the truth is often hard to discern when human authors are inherently biased and LLMs train on human language on the web. 

Grok 4, like its peers, is not immune to this. Grok is designed to seek truth and avoid partisan traps, but the foundational data reflects the internet’s imbalances. Claims of any model being “unbiased” overlook this reality; actual progress comes from diverse data curation and ongoing fine-tuning, which xAI prioritizes to counter any group’s dominance or bias as it strives to seek the maximum truth, which is a considerable endeavor. Attempting to minimize bias will always be a controversial endeavor, but it is worthwhile to try.

In conclusion, while Mehul Gupta’s Medium article raises valid questions about specific benchmarks, it presents an unfairly negative picture by overlooking Grok 4’s groundbreaking achievements and contextual factors, such as untuned coding capabilities. AI development is an iterative process, and Grok 4 represents a significant leap forward in this progression. Rather than declaring “Elon lied again,” let’s focus on evidence-based discourse to push the field ahead. If you’re testing Grok 4 yourself, remember: its strengths lie in reasoning and real-world utility, not just creative flair.


About the Author

Rick Hightower brings extensive enterprise experience as a former executive and distinguished engineer at a Fortune 100 company, where he specialized in Machine Learning and AI solutions to deliver intelligent customer experiences. His expertise spans both theoretical foundations and practical applications of AI technologies.

As a TensorFlow-certified professional and graduate of Stanford University’s comprehensive Machine Learning Specialization, Rick combines academic rigor with real-world implementation experience. His training includes mastery of supervised learning techniques, neural networks, and advanced AI concepts, which he has successfully applied to enterprise-scale solutions.

With a deep understanding of both business and technical aspects of AI implementation, Rick bridges the gap between theoretical machine learning concepts and practical business applications, helping organizations leverage AI to create tangible value.

Follow Rick on LinkedIn or Medium for more enterprise AI and AI insights.

To view or add a comment, sign in

Others also viewed

Explore topics