Peeking Inside the AI's Mind

Peeking Inside the AI's Mind

What Attribution Graphs Reveal About Machine Thinking

Make the Complex Relatable

Every morning, my dog Galway tilts her head watching me make coffee.

Her deep brown eyes follow my movements with curiosity, and I often wonder what's happening behind them.

What neural pathways light up as she anticipates the sound of the grinder or recognizes the scent of freshly roasted beans?

Understanding how another mind thinks, even one as familiar as my dog's. This is surprisingly difficult.

We can observe inputs (the sound of the grinder) and outputs (her excited tail wag), but the processing in between remains a mystery.

While your brain processes more information daily than most computers do yearly, AI can write essays in seconds.

How does this actually work?

Until recently, AI systems were black boxes. We could see what went in and what came out, but had no visibility into the reasoning process between.

The Black Box Problem

My first encounter with a proper espresso machine was at D2, the pub in Ireland where I pulled pints during my internship.

The owner had installed this imposing Italian beast with copper pipes and pressure gauges—a contraption that looked more suited to a Victorian laboratory than our humble bar.

I would fiddle with its knobs and levers, somehow producing the perfect cappuccino—ideal temperature, exquisite crema, velvety milk foam.

Yet whenever it malfunctioned, I'd stand there utterly bewildered. The machine worked brilliantly when I followed the steps, but I had absolutely no idea how it achieved such magic.

Did it truly understand the principles of coffee extraction, milk texturing, and flavor balance, or did it just follow preset patterns and execute predefined steps when I turned the right valves?

This familiar mystery mirrors the problem we've had with large language models (LLMs) like Claude, GPT, and others.

We provide a prompt (input), receive a response (output), but the process in between.

How the model actually "thinks" through the task. Has remained as inscrutable as the inner workings of that copper-plated espresso maker.

Is the AI truly understanding the principles of language, reasoning through problems step-by-step, or just mimicking patterns it has seen before?

Without seeing inside, we can only guess. At least until now.


Article content

The X-Ray for AI Thoughts

This is where attribution graphs come in.

They're essentially MRI scans for AI, showing which "neural pathways" activate when the AI thinks.

Developed by researchers at Anthropic, these visualization tools reveal the internal chains of reasoning that transform a prompt into a response.

An attribution graph traces the connections between different "features" (similar to concepts) inside the model, showing us which ones influence the final output and how they connect to each other.

Just as a detective connects evidence on a board with red strings, attribution graphs map out the reasoning path the AI follows from question to answer.

What the Science Reveals


Article content

When researchers examined these attribution graphs, they found something remarkable: large language models don't just match patterns—they perform multi-step reasoning that sometimes parallels human thought processes.

For example, when asked "What's the capital of the state containing Dallas?", the model doesn't just mechanically spit out "Austin." Instead, the attribution graph reveals a clear chain of reasoning:

It identifies "Dallas" as being in "Texas"

It recognizes "Texas" as a state

It connects "Texas" with its capital "Austin"

Researchers can even manipulate these thought paths.

By suppressing the "Texas" feature and injecting "California," they can change the output from "Austin" to "Sacramento". Essentially redirecting the AI's chain of thought in predictable ways.

The Planning Poet

Even more surprising is what happens when these models write poetry.

Contrary to the assumption that AI works purely token-by-token with no foresight, attribution graphs reveal that these systems actually plan ahead.

When writing a rhyming couplet ending with "rabbit," the model activates rhyming features that consider words like "habit" before even beginning to write the line.

This is a form of advance planning we previously thought was uniquely human.

It's as if the coffee machine doesn't just make a perfect cappuccino, but it anticipates that you might want a biscotti on the side and starts preparing for that possibility before you even ask. All my dreams could finally become reality.


Article content

Human vs. Machine Cognition

While there are fascinating parallels, AI "thinking" differs fundamentally from human cognition in crucial ways:

🍄 Embodied vs. Disembodied Knowledge: My dog Galway understands "walk" because she's physically experienced walks—the excitement, the smells, the feeling of grass under her paws.

An LLM has never felt gravity, seen a sunset with its own "eyes," or experienced physical sensations. Its understanding of "walking" comes exclusively from textual descriptions.

🍄 Network Structure: Both are massive, interconnected networks processing information through distributed activity rather than in a single central location.

🍄 Pattern Recognition: Both excel at identifying complex patterns, though they learn these patterns through very different experiences.

🍄 Planning and Abstraction: Both can plan ahead and work with abstract concepts, but humans do this with the benefit of embodied experience while AI works solely with statistical patterns.

This disembodied nature of AI cognition creates both limitations and unique capabilities that differ from human (or canine) thinking.


Article content

Not Quite a Brain

It's important to note that despite these similarities, attribution graphs aren't perfect windows into AI cognition:

They simplify extremely complex processes

The features they identify don't map perfectly to human concepts

They can't capture all the nuances of the model's operations

Additionally, making too direct a comparison to human thinking risks anthropomorphizing these systems in misleading ways. While we might see processes that resemble planning or reasoning, they emerge from fundamentally different mechanisms than human thought.


Article content

Why This Matters

Understanding these attribution graphs has profound practical implications beyond academic curiosity:

Debugging AI Behavior: When models make mistakes, attribution graphs help pinpoint precisely where their reasoning went wrong.

Improving Safety: By identifying how models decide whether to refuse harmful requests, we can strengthen these mechanisms.

Enhancing Trust: As AI systems become integrated into critical decisions, understanding their reasoning builds confidence in their outputs.

Scientific Discovery: Analyzing how AI connects concepts can uncover new perspectives that human researchers might overlook.

Manipulating AI Thought: Researchers can now conduct experiments by directly altering internal features to change outputs predictably.

From Black Box to Glass Box

Returning to my morning coffee ritual with Galway watching attentively.

I may never know exactly what connections form in her mind as she recognizes the coffee grinder sound and anticipates our morning walk that follows.

But with attribution graphs, the mysterious black box of AI is gradually becoming transparent.

Just as an X-ray transformed medicine by allowing doctors to see inside the body, attribution graphs let us see inside the AI "mind," turning these systems from inscrutable oracles into transparent thinking partners.

While my dog Galway's thought processes might remain a mystery, the internal workings of our AI assistants are becoming clearer every day.

And that clarity opens up whole new horizons for human-AI collaboration.

The Generous Transparency

We've spent decades treating AI as mystical, unexplainable wizardry.

That ends now.

Attribution graphs don't just reveal how machines think.

They expose how we've been thinking about machines. We've accepted black boxes because we've forgotten that transparency isn't just possible, it's necessary.

The real magic happens not when systems become more complex, but when complexity becomes legible.

What if every AI system came with a "show me your work" button?

The most valuable AI won't be the most intelligent. It will be the one that brings you along for the ride, that trades mystery for clarity, that treats you as a partner rather than a passenger.

This isn't just about technology. It's about choice.

The choice to demand systems we can understand, rather than being asked to simply trust.

The attribution graph isn't the end of AI development. It's just the beginning of AI accountability.

This level of transparency could transform how we collaborate with AI systems, especially in high-stakes fields like medicine, law, and scientific research.

P.S. For those who want to explore deeper, Anthropic's research video shows these graphs in action: Understanding AI Reasoning

To view or add a comment, sign in

Others also viewed

Explore topics