Paper Review: Visual Planning: Let’s Think Only with Images

Paper Review: Visual Planning: Let’s Think Only with Images

Paper

Code

Visual Planning is a new approach where reasoning and planning are done using sequences of images rather than text, particularly for tasks involving spatial or geometric reasoning. The authors argue that language isn’t always the most natural or effective medium for such tasks. They propose Visual Planning via Reinforcement Learning, which uses GRPO to fine-tune large vision models. This approach outperforms traditional text-based reasoning methods on visual navigation tasks.

The approach

Article content

Most previous visual reasoning methods convert visual input into text (object names or relations) and then perform reasoning using language models. In contrast, the visual planning paradigm keeps reasoning entirely within the visual domain. It generates a sequence of images (a visual trajectory) that represents the step-by-step planning process. The visual planning trajectory is generated autoregressively, where each image is based on the initial input image and previously generated images.

Reinforcement Learning for Large Vision Models

Article content

Stage 1: Policy Initialization. The model is pretrained using supervised learning on random walk trajectories through the environment. These trajectories consist of sequences of visual states. From each trajectory, the model learns to predict the next visual state given a prefix. The model predicts a set of possible next steps, and a single candidate is sampled randomly. This stage ensures the model can generate visually coherent sequences and serves as a warm-up for RL.

Stage 2: Reinforcement Learning for Visual Planning. The model transitions from supervised learning to RL to improve its ability to generate effective visual plans. In each step, the model produces a group of possible next images. These are interpreted using a rule-based parser that infers the action each image transition represents. A reward function then scores each transition by comparing the resulting state’s proximity to the goal, using a predefined progress map. Actions that bring the model closer to the goal receive a positive reward, those that make no progress receive zero, and invalid transitions (for example, violating physical constraints) are penalized heavily.

GRPO computes relative advantages by comparing rewards within each candidate group. This allows the model to focus on higher-quality planning decisions while maintaining diversity and stability through importance sampling and KL regularization.

System Variants

The authors suggest two baseline methods to compare different training strategies and modalities for planning:

  • VPFT (Visual Planning via Fine-Tuning) is a simplified version of the main framework. It uses optimal planning trajectories instead of random ones and applies supervised learning to predict the next visual state at each step.
  • Supervised Fine-Tuning (SFT) in Text formulates planning as a language task. Given an image and a textual prompt, the model generates a textual action sequence instead of visual states. It is trained using standard cross-entropy loss to predict the correct sequence of actions.

Results

Article content

The authors evaluate their approach on three visual navigation environments:

  • FrozenLake: simulates a grid-based frozen lake, where the agent is supposed to start from the designated position and find its way to the destination safely without falling into the “holes”.
  • Maze: Given an initial image describing the maze layout, the model is supposed to go through the maze from the starting point to the destination.
  • MiniBehavior: The agent is first required to reach the printer from the starting point and pick it up. After that, it should go to the table and drop the printer.

Article content

Visual planning methods (VPFT and VPRL) outperform all language-based baselines across multiple tasks. VPFT, even with the same supervised training setup, achieves over 22% higher Exact Match than text-based SFT. VPRL, which adds reinforcement learning, performs even better (especially on complex tasks) and achieves near-perfect scores on simpler tasks.

Reinforcement learning proves highly effective: after Stage 2, VPRL improves planning accuracy by over 20% compared to VPFT. It allows the model to explore and learn from outcomes, unlike supervised methods that just imitate.

Article content

Finally, visual planners, especially VPRL, show strong robustness to task complexity. While models like Gemini 2.5 Pro drop sharply in accuracy as task difficulty increases (increasing the grid size from 3x3 to 6x6), VPRL maintains high performance with minimal degradation, confirming its scalability and stability.

Discussions and Analysis

The error analysis shows that visual planners sometimes take non-optimal paths, but they better avoid invalid actions like walking through walls or executing multiple moves at once. In contrast, language-based models like Gemini 2.5 Pro and text-based SFT often misinterpret environments or lose track of state, leading to cascading errors.

Article content

VPRL shows greater flexibility (able to take detours and still reach the goal) while VPFT tends to get stuck. This is because VPRL is trained with random policy initialization, which encourages broader exploration. In contrast, initializing from VPFT limits exploration, as it repeatedly follows similar patterns, yielding no learning signal during RL training. This is supported by entropy analysis: VPFT quickly collapses to low-entropy (repetitive) behavior, while VPRL maintains high entropy with fewer invalid actions.

Finally, VPRL significantly reduces the invalid-failure ratio (the proportion of failed plans due to invalid actions) by at least 24% across tasks, showing it not only succeeds more often but also adheres better to environmental constraints.

To view or add a comment, sign in

Others also viewed

Explore topics