Why GPT-5 can’t count the number of 'B's in 'Blueberry' (and why that’s okay)

Anol Bhattacharya

EVP Innovation & Technology, Hotwire Global | Head of AI Lab | 25+ Years’ B2B IT Marketing & ABM Leadership

Published Aug 12, 2025

You ask an advanced model to count the B’s in Blueberry. It hesitates, it bluffs, or it’s just wrong. How can something that drafts legalese and writes Python fail a primary-school puzzle?

Because large language models (LLMs) don’t “run rules.” They predict text.

They’re astonishing at shaping language; they’re not built to count letters—unless they’ve learned a reliable pattern for doing so in context. That simple difference unlocks most of the mystery (and the hype) around LLMs.

The short version

Traditional programs: You write explicit step-by-step rules. Same input → same output.
LLMs: We train a neural network on massive text to estimate “What token likely comes next?” Outputs are probabilistic, guided by statistics, not by a hardcoded algorithm.

This shift—sometimes called “Software 2.0”—moves power from rules to data + optimisation. When we scale data, compute, and model size (hello, Transformers), we get capabilities that look like reasoning and understanding, but are really pattern mastery at an industrial scale.

How an LLM actually works (in plain English)

Tokenisation: Text gets chopped into tiny pieces called tokens (often sub-words).
Embeddings: Each token becomes a vector—a point in a high-dimensional space where “nearby” often means “related.”
Attention: The model learns what to focus on in the context (“pay more attention to this clause, ignore that aside”).
Training objective: Minimise the surprise of the next token across billions of examples (maximum likelihood).
Decoding: At run time, we sample the next token. Controls like temperature, top-k, and top-p decide how adventurous or conservative the model is.

Think of it as a statistical autocomplete on steroids, running through a map of vector spaces where meaning is approximated by geometry.

Tokenisation: how LLMs chop text into subword pieces

So…why the “Blueberry” face-plant?

Letter-counting is a symbolic problem (an exact algorithm). LLMs are statistical engines. They can simulate counting with patterns they’ve seen (“count characters one by one”). Still, they don’t execute a guaranteed algorithm internally unless the prompt or toolchain forces a step-by-step, verifiable method (e.g., having the model write and run code), errors happen—especially with quirky casing, repeated letters, or tokenisation edge cases.

This is the same reason they sometimes fumble arithmetic, dates, or long chains of logic: the core objective is plausible continuation, not correctness.

Six myths I hear every week (and the reality)

“LLMs know facts and think like humans.”
They don’t know; they’ve learned patterns of how facts are written. Fluent ≠ true.
“They can count perfectly.”
Not reliably. They weren’t designed as symbolic calculators. They can imitate counting; that’s different from doing it with guarantees.
“Their goal is to be correct.”
The base goal is to predict the next token. We align them (instruction-tuning, RLHF) to be helpful and honest, but the underlying objective doesn’t change.
“They’re bad at math because they’re not smart.”
They’re just the wrong tool. Please give them a calculator or a Python runtime and watch accuracy jump. Tools add symbolic rigour to a statistical brain.
“They understand the world like we do.”
No senses, no experience, no consciousness—just text associations. That’s powerful, but different from human understanding.
“They keep learning after deployment.”
Not by default. Weights are frozen until retrained/fine-tuned. Some products add memory or retrieval, but that’s system design, not magical self-learning.

Hallucinations, bias, and other “gotchas” (the candid section)

Hallucinations: Confident, detailed wrongness. When the context is thin or the question nudges beyond training data, the model still must output something, so it “fills in” with statistically plausible but ungrounded text.
Mitigate with: retrieval (ground answers in sources), asking for citations, letting the model say “I don’t know,” or routing to tools.
Bias: Models inherit patterns—including undesirable ones—from data.
Mitigate with: careful data curation, post-training filters, evaluation on fairness/toxicity benchmarks, and human oversight for high-stakes uses.
Temperature & randomness:
Higher temperature → more creative, more error-prone. Lower temperature → steadier, sometimes bland. Tune it to the job.
Knowledge cutoffs:
A model’s internal knowledge is frozen at its last training date. For “What happened today?”, you need to search for an external data connector.

How to get reliably better answers

Be specific. Give constraints, formats, and success criteria (“Return a table with columns X/Y/Z”).
Make it think in steps. Ask for reasoning or an outline before the final answer.
Ask for sources. Or better, attach them and say, “Only cite from these.”
Use tools. Let the model call search, code, calculators, or company systems.
Control randomness. Lower the temperature for accuracy; raise it for brainstorming.
Validate. For facts, require citations. For numbers, have it compute (or compute externally). For long tasks, break them into verifiable chunks.

Where LLMs do shine (and where to pair them with tools)

Language-shaped work: drafting, summarising, translating, rewriting, brainstorming.
Knowledge navigation: Q&A over docs—with retrieval for grounding.
Coding assistance: generate scaffolds, tests, and refactors; execute to verify.
Customer & market insight: cluster themes from interviews, tickets, or reviews; then a human validates takeaways.
Ops copilots: turn plain-English intent into workflows—provided a system or a person-in-the-loop checks each step.

When the task demands truth, math, or compliance, couple the model with retrieval, rules, or execution environments. That’s the winning pattern: LLM + Tools + Guardrails.

A quick mental model to take with you

LLM = probabilistic text engine that becomes dependable when you add structure (clear prompts), add grounding (your data), and add execution (tools & checks).

If you remember that, you won’t be surprised when it flubs the B’s in Blueberry—and you’ll know exactly how to make it perform when it really matters.

Subhashish Das

💡 Cloud-Native Engineering Leader | .NET Solutions Architect | Azure | AI/ML-Integrations

1mo

Anol as I know him from the initial days; despite carving out a great career in creative and digital marketing space, comes from a tech background, so he is that person who can blend in the tech aspects in his creativity very well which goes on to resonate well with stakeholders in the digital industry. To simplify the AI for non tech people to understand its pitfalls and nuances he is the best guy to take up this venture and his writings deserve a good read.

1 Reaction

Christopher Pappas ∴ 🌿

🚀 Founder @eLearning Industry | Forbes Contributor | Growth Partner to L&D & HR Innovators

1mo

Such a clear way to explain it. “Predict ≠ count” will stick with me. For marketers, that shift in thinking changes everything about how we brief, prompt, and sanity-check AI output.

LinkedIn respects your privacy

Why GPT-5 can’t count the number of 'B's in 'Blueberry' (and why that’s okay)

Anol Bhattacharya

EVP Innovation & Technology, Hotwire Global | Head of AI Lab | 25+ Years’ B2B IT Marketing & ABM Leadership

The short version

How an LLM actually works (in plain English)

Tokenisation: how LLMs chop text into subword pieces

So…why the “Blueberry” face-plant?

Six myths I hear every week (and the reality)

Hallucinations, bias, and other “gotchas” (the candid section)

How to get reliably better answers

Where LLMs do shine (and where to pair them with tools)

A quick mental model to take with you

More articles by this author

Explore content categories

The short version

How an LLM actually works (in plain English)

Tokenisation: how LLMs chop text into subword pieces

So…why the “Blueberry” face-plant?

Six myths I hear every week (and the reality)

Hallucinations, bias, and other “gotchas” (the candid section)

How to get reliably better answers

Where LLMs do shine (and where to pair them with tools)

A quick mental model to take with you

What Is It Like to Be a Large Language Model?

Sep 11, 2025

7 AI Terms Every Marketing and Comms Leader Should Know

Sep 2, 2025

Prompt Engineering vs. Context Engineering: AI for Marketing & Comms in Plain English PIII

Aug 22, 2025

These Are the Prompts You’re Looking For: A Field Guide to GPT‑5 Prompting

Aug 13, 2025

Winning at the Era of Zero‑Click Search: 101 of Generative Engine Optimisation (GEO)

Aug 4, 2025

MCP vs API: AI for Marketing and Comms in Plain English (Part 1)

Jul 31, 2025

Late to the Party, Early with the Takeaways: B2B Branding, Findability, and the Gems I Pocketed from B2B Marketing Leaders' forum

May 27, 2025

The Hype of Vibe Marketing - And Why We're Not Quite Ready Yet

Apr 23, 2025

Marketing AI 'Experts' and the Synthetic Knowledge Problem

Mar 31, 2025

Brand Visibility Measurement for AI Search: an Open Letter to Chris Penn

Mar 28, 2025

Explore content categories