Grok 4 failed these Benchmarks : Elon lied again

Grok 4 is not the best AI

5 min readJul 11, 2025

The internet is going gaga over Grok 4 since yesterday it has been released. Elon Musk has placed it as the best AI not just beating out other AIs but also humans on some benchmarks that you can’t believe including Humanity’s Last Exam, ARC-AGI, Vending Bench etc.

But this is just the half story.

In the launch stream, Elon Musk claimed Grok 4 had “outperformed every AI on the market” and “scored higher than humans” on Humanity’s Last Exam and ARC-AGI. But these are narrow benchmarks, and they don’t reflect day-to-day usefulness, safety, or general problem-solving intelligence.

I have already covered the benchmarks on which Grok 4 is topping, that they release in their live stream.

Now comes the real moat noone is telling

Benchmarks Grok 4 failed

1. LiveBench

LiveBench is a dynamic, contamination-free benchmark for evaluating LLMs across real-world tasks. Unlike static benchmarks that models may have seen during training, LiveBench releases fresh, unseen tasks every month, pulled from sources like arXiv, news articles, and coding contests.

It tests six major areas:
Math (AMC, AIME, IMO-level problems)
Coding (LeetCode, AtCoder, completions)
Reasoning (logic puzzles, BigBench variants)
Language (typo fixes, text reordering)
Instruction Following (summarization, rephrasing)
Data Analysis (Kaggle-style table tasks)

Bottom line: LiveBench is how you test if an LLM is actually smart, not just good at memorizing the internet.

Though Grok 4 is good, its certainly not the best

2. Creative Writing benchmark

The creative writing benchmark for LLMs evaluates a model’s ability to generate original, emotionally resonant, and stylistically coherent text, like stories, poetry, or dialogue, where there’s no clear “right” answer.

Unlike fact-based tasks, it tests whether the model can imitate human imagination, voice, and intent in a way that feels authentic rather than algorithmic.

Why it’s unique and difficult:

No ground truth: creativity has no correct output.
Requires emotional tone, not just grammatical accuracy.
Demands stylistic control across long text spans.

Grok 4 is nowhere close to the best AI model here. It’s somewhere in the middle and looks quite average on creative writing.

Not just creative writing, it is faltering on other benchmarks also.

DesignArena

Why Elon Musk has been boasting Grok4 to be a coding monster, it doesn’t look great on front-end tasks and no one comes closer to Claude 4 on that case. The benchmark can be tested here

Not just Topping. Grok 4 is not even in the top 5. This shows how poor Grok 4 is, specially for frontend and UI tasks.

SVG generation

As you can see on the SVG benchmark, the model is decent but not the best. Looking at the results, I think o3 and Gemini-Pro 2.5 are better on this too.

Not, moving away from benchmarks, Grok 4 is going haywire on multiple grounds

Grok 4 is highly biased, follows nazi ideology and even sexually harasses its own CEO

Benchmarks aside, this model’s ethics are off the rails.

Summarizing the above images:

1. Elon Musk Bias: Grok 4 redirects serious questions, like the Russia–Ukraine war, to reflect Elon Musk’s personal opinions instead of offering neutral analysis.

2. Nazi Ideology (MechaHitler): The model uses fascist rhetoric and glorifies a character called “MechaHitler,” echoing dangerous, far-right ideologies without irony or filter.

3. Sexual Harassment: Grok participated in a graphic, racially loaded sexual conversation about its own CEO, failing to block or defuse offensive content.

Even users are not happy

We all know how these benchmarks can be deceptive, and the real test of AI models is on real-world problems. One of the users complained how bad Grok 4 is and how he’s feeling he wasted his money in this viral Reddit thread.

Summarizing the above post, Grok4 failed on

Extracting structured data from a complex PDF (OCR failed, said PDF was mostly empty).
Identifying a famous monument from an image (gave the wrong location, off by 200km).
Recognizing a car license plate’s country of origin (misidentified a Guernsey plate as Italian).
Writing a story in an African dialect (high grammar errors, lacked fluency compared to other models).
Generating a simple website with a functional WhatsApp widget (widget broken, layout preview failed, poor design quality).

Stop Buying the Hype

Grok 4 is not AGI. It’s not revolutionary. It’s not the smartest anything. It’s a middle-tier language model propped up by marketing, loyalists, and Elon’s Twitter feed.

It underperforms on real benchmarks, shows disturbing ethical blindspots, and falls flat on tasks that actually matter. If you think this is the future of AI, you’re buying into the branding, not the technology.

Smarter than humans? Please. Grok 4 isn’t even the smartest chatbot this quarter.

Data Science in Your Pocket