Paper Review: Visual Planning: Let’s Think Only with Images

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

Published May 26, 2025

Visual Planning is a new approach where reasoning and planning are done using sequences of images rather than text, particularly for tasks involving spatial or geometric reasoning. The authors argue that language isn’t always the most natural or effective medium for such tasks. They propose Visual Planning via Reinforcement Learning, which uses GRPO to fine-tune large vision models. This approach outperforms traditional text-based reasoning methods on visual navigation tasks.

The approach

Most previous visual reasoning methods convert visual input into text (object names or relations) and then perform reasoning using language models. In contrast, the visual planning paradigm keeps reasoning entirely within the visual domain. It generates a sequence of images (a visual trajectory) that represents the step-by-step planning process. The visual planning trajectory is generated autoregressively, where each image is based on the initial input image and previously generated images.

Reinforcement Learning for Large Vision Models

Stage 1: Policy Initialization. The model is pretrained using supervised learning on random walk trajectories through the environment. These trajectories consist of sequences of visual states. From each trajectory, the model learns to predict the next visual state given a prefix. The model predicts a set of possible next steps, and a single candidate is sampled randomly. This stage ensures the model can generate visually coherent sequences and serves as a warm-up for RL.

Stage 2: Reinforcement Learning for Visual Planning. The model transitions from supervised learning to RL to improve its ability to generate effective visual plans. In each step, the model produces a group of possible next images. These are interpreted using a rule-based parser that infers the action each image transition represents. A reward function then scores each transition by comparing the resulting state’s proximity to the goal, using a predefined progress map. Actions that bring the model closer to the goal receive a positive reward, those that make no progress receive zero, and invalid transitions (for example, violating physical constraints) are penalized heavily.

GRPO computes relative advantages by comparing rewards within each candidate group. This allows the model to focus on higher-quality planning decisions while maintaining diversity and stability through importance sampling and KL regularization.

System Variants

The authors suggest two baseline methods to compare different training strategies and modalities for planning:

VPFT (Visual Planning via Fine-Tuning) is a simplified version of the main framework. It uses optimal planning trajectories instead of random ones and applies supervised learning to predict the next visual state at each step.
Supervised Fine-Tuning (SFT) in Text formulates planning as a language task. Given an image and a textual prompt, the model generates a textual action sequence instead of visual states. It is trained using standard cross-entropy loss to predict the correct sequence of actions.

Results

The authors evaluate their approach on three visual navigation environments:

FrozenLake: simulates a grid-based frozen lake, where the agent is supposed to start from the designated position and find its way to the destination safely without falling into the “holes”.
Maze: Given an initial image describing the maze layout, the model is supposed to go through the maze from the starting point to the destination.
MiniBehavior: The agent is first required to reach the printer from the starting point and pick it up. After that, it should go to the table and drop the printer.

Visual planning methods (VPFT and VPRL) outperform all language-based baselines across multiple tasks. VPFT, even with the same supervised training setup, achieves over 22% higher Exact Match than text-based SFT. VPRL, which adds reinforcement learning, performs even better (especially on complex tasks) and achieves near-perfect scores on simpler tasks.

Reinforcement learning proves highly effective: after Stage 2, VPRL improves planning accuracy by over 20% compared to VPFT. It allows the model to explore and learn from outcomes, unlike supervised methods that just imitate.

Finally, visual planners, especially VPRL, show strong robustness to task complexity. While models like Gemini 2.5 Pro drop sharply in accuracy as task difficulty increases (increasing the grid size from 3x3 to 6x6), VPRL maintains high performance with minimal degradation, confirming its scalability and stability.

Discussions and Analysis

The error analysis shows that visual planners sometimes take non-optimal paths, but they better avoid invalid actions like walking through walls or executing multiple moves at once. In contrast, language-based models like Gemini 2.5 Pro and text-based SFT often misinterpret environments or lose track of state, leading to cascading errors.

VPRL shows greater flexibility (able to take detours and still reach the goal) while VPFT tends to get stuck. This is because VPRL is trained with random policy initialization, which encourages broader exploration. In contrast, initializing from VPFT limits exploration, as it repeatedly follows similar patterns, yielding no learning signal during RL training. This is supported by entropy analysis: VPFT quickly collapses to low-entropy (repetitive) behavior, while VPRL maintains high entropy with fewer invalid actions.

Finally, VPRL significantly reduces the invalid-failure ratio (the proportion of failed plans due to invalid actions) by at least 24% across tasks, showing it not only succeeds more often but also adheres better to environmental constraints.

Paper Review: Visual Planning: Let’s Think Only with Images

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

The approach

Reinforcement Learning for Large Vision Models

System Variants

Results

Discussions and Analysis

More articles by this author

Others also viewed

The Art of Conversational Authoring: How AI Interaction Mirrors the Craft of Fiction Writing

The Rise of the Automation Translator: Bridging Robots and Real Work

AI-Powered Storytelling in the Business World: Those Who Tell Well, Win!

The Foundation Shift

Transcribing Audio to Text: Using Gemini Pro for Advanced Text-Based Responses

Can AI mimic human writing?

AI and the Future of Storytelling

How to Make Documentation an Assistive Tool Rather Than Just a Reference Manual

Agentic AI - Writing Your First LangGraph Agent

From rules to models: Progresses in document layout analysis for low resource languages

Explore topics

The approach

Reinforcement Learning for Large Vision Models

System Variants

Results

Discussions and Analysis

Paper Review: Group Sequence Policy Optimization

Aug 4, 2025

Paper Review: Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Jul 28, 2025

Paper Review: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Jun 30, 2025

Paper Review: V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Jun 23, 2025

Paper Review: Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Jun 9, 2025

Paper Review: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Jun 2, 2025

Paper Review: AlphaEvolve: A coding agent for scientific and algorithmic discovery

May 15, 2025

Paper Review: AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Apr 28, 2025

Paper Review: M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Apr 21, 2025

Paper Review: TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Apr 7, 2025

Others also viewed

The Art of Conversational Authoring: How AI Interaction Mirrors the Craft of Fiction Writing

The Rise of the Automation Translator: Bridging Robots and Real Work

AI-Powered Storytelling in the Business World: Those Who Tell Well, Win!

The Foundation Shift

Transcribing Audio to Text: Using Gemini Pro for Advanced Text-Based Responses

Can AI mimic human writing?

AI and the Future of Storytelling

How to Make Documentation an Assistive Tool Rather Than Just a Reference Manual

Agentic AI - Writing Your First LangGraph Agent

From rules to models: Progresses in document layout analysis for low resource languages

Explore topics