What a Maze-Trotting Robot Can Teach You About the Bellman Equation

Manasi Dubey

Product @Aurigo | Ex-Licious | Ex- Zycus | IIM Sirmaur (MBA '22) | ICT Engineer

Published Apr 10, 2025

Imagine you're a robot lost in a maze.

You start moving randomly. Every time you hit a green square, you get a +1 reward. Red squares? They hurt. You lose -1. All the other squares give you nothing.

At first, you're clueless. You stumble into walls, repeat mistakes, and sometimes, by pure chance, hit a green square.

Eventually, you realize something:

“Taking this action from that square led to something good!”

This is the foundation of Reinforcement Learning—an agent (like you, the robot) learning through trial and error, gradually figuring out which states (positions in the maze) are valuable.

Step 1: Learning Values

In our robot-maze story, each square or state has a hidden value: a number that tells the robot how good it is to be there.

Green square? High value.
Red square? Avoid it.
Empty squares? Neutral, until you realize which lead to rewards.

The robot doesn’t know the values right away. It learns them through experience. Every time it gets a reward, it updates its understanding of the states that led to it.

This is where the concept of a value function comes in:

It assigns a number to every state, representing the expected total reward from that point onward.

But here’s the catch: how does the robot update these values?

Enter the Bellman Equation

The robot starts thinking:

“If I’m standing in this square, what’s the best reward I can hope to get?”

And that’s where the Bellman Equation steps in. It’s like a mathematical crystal ball that helps the robot estimate future rewards.

Let’s understand the idea before we touch the formula.

The value of a state isn’t just based on what you get now—it includes:

The immediate reward from taking an action, plus
The value of the next state you’ll land in.

But since the future is uncertain, we use something called a discount factor (γ) to reduce the importance of far-away rewards.

Bellman Equation (the actual formula):

V(s) = maxₐ [ R + γ * V(s') ]

Let’s break it down:

V(s): Value of the current state
a: Possible action the agent can take
R: Immediate reward from taking action a
s’: Next state the agent ends up in
V(s’): Value of that next state
γ (gamma): Discount factor between 0 and 1 that tells us how much we care about future rewards

The agent considers all possible actions, evaluates where they would lead, adds the expected reward, and picks the one that gives the highest overall value.

Refer this article for calculation: https://medium.com/analytics-vidhya/reinforcement-learning-4dcd139f82bc

A Real-World Analogy

Let’s say you’re a product manager deciding between two projects:

Project A: Launches fast, gives a small boost in KPIs immediately.
Project B: Takes longer, but has the potential for compounding value in future quarters.

If you only look at immediate benefits, you’ll always choose Project A. But if you apply the Bellman logic:

Total Value = Immediate Impact + (γ * Long-Term Impact)

Suddenly, Project B might make more sense—even if it’s slower—because you’re valuing the future state.

This is exactly how intelligent agents (and smart PMs!) make better decisions.

Bellman in Action: Behind the Scenes of AI

The Bellman equation powers decision-making in:

Self-driving cars: “What route minimizes risk and time?”
Game AIs (like AlphaGo): “What move gives me the best long-term chance of winning?”
Recommendation systems: “Which product should I show now to maximize future conversions?”

Even finance models and robotics use it to make optimal decisions based on uncertainty.

Key Takeaways

The Bellman equation is how agents estimate the true value of being in a state.
It balances short-term rewards with long-term gains using a discount factor.
It’s recursive—each state’s value depends on the values of other future states.
It’s the foundation for value iteration, Q-learning, and modern reinforcement learning.

PMs, Why Should You Care?

If you’re working on:

Products with ML/AI-powered decision-making
Dynamic systems that learn from user behavior
Long-term optimization strategies (retention, LTV, etc.)

Understanding the Bellman equation can help you:

Ask smarter questions in AI conversations
Design better feedback loops
Think like a machine—and make data-backed decisions

What a Maze-Trotting Robot Can Teach You About the Bellman Equation

Manasi Dubey

Product @Aurigo | Ex-Licious | Ex- Zycus | IIM Sirmaur (MBA '22) | ICT Engineer

Step 1: Learning Values

Enter the Bellman Equation

Bellman Equation (the actual formula):

A Real-World Analogy

Bellman in Action: Behind the Scenes of AI

Key Takeaways

PMs, Why Should You Care?

More articles by this author

Others also viewed

Abundance Insider: October 23rd, 2019

The rise of humanoids and their impact on technology and industries

Beyond Steel and Code: The New Era of Robotics and AI

Agibot’s Decepticons vs. Elon’s Optimus

The Rise of Robots: Entering the Decade of Robotics

AI Weekly Pulse: Robots Walk, Emails Shrink, Grammarly Banks Big

AI, MLOps & Robotics Newsletter #115

The Future of Humanity: AI & ROBOTICS

AI, MLOps & Robotics Newsletter #108

The "Bitter Lesson Threshold" & Normalized Task Distributions in Robotics

Explore topics

Step 1: Learning Values

Enter the Bellman Equation

Bellman Equation (the actual formula):

A Real-World Analogy

Bellman in Action: Behind the Scenes of AI

Key Takeaways

PMs, Why Should You Care?

How Neural Networks Actually "Think": A Real Estate Detective Story

Jun 4, 2025

The Secret Sauce of Neural Networks: A Fun Guide to Activation Functions

May 31, 2025

The Neuron: Building Block of Intelligence (Both Natural and Artificial)

May 8, 2025

From Q-Learning to Deep Learning: Understanding the Neural Foundation

May 3, 2025

Temporal Difference: The Heart and Soul of Q-Learning

Apr 30, 2025

Q-Learning Explained: Making Decisions in Uncertain Environments

Apr 26, 2025

Understanding the Living Penalty: How Reinforcement Learning and Policies Shape Decisions

Apr 24, 2025

Plans vs Policies: Navigating Uncertainty in Stochastic Environments

Apr 20, 2025

Demystifying Markov Decision Processes: The Brains Behind Smarter AI

Apr 16, 2025

From Value to Action: How AI Turns Insights into Plans

Apr 12, 2025

Others also viewed

Abundance Insider: October 23rd, 2019

The rise of humanoids and their impact on technology and industries

Beyond Steel and Code: The New Era of Robotics and AI

Agibot’s Decepticons vs. Elon’s Optimus

The Rise of Robots: Entering the Decade of Robotics

AI Weekly Pulse: Robots Walk, Emails Shrink, Grammarly Banks Big

AI, MLOps & Robotics Newsletter #115

The Future of Humanity: AI & ROBOTICS

AI, MLOps & Robotics Newsletter #108

The "Bitter Lesson Threshold" & Normalized Task Distributions in Robotics

Explore topics