Thyme: Think Beyond Images
Today's paper introduces Thyme (Think Beyond Images), a novel approach that enables multimodal large language models to autonomously generate and execute code for diverse image processing operations and mathematical computations. Unlike existing "think with images" methods that are limited to simple cropping or image generation, Thyme provides rich functionality including rotation, contrast enhancement, zooming, and complex calculations while maintaining high autonomy in deciding when and how to apply these operations.
Method Overview
Thyme operates through a pipeline where the model first analyzes a given problem and determines whether code generation is necessary. If the problem is simple enough, it provides a direct answer. However, for complex scenarios requiring image manipulation or mathematical computation, the model autonomously generates Python code to perform operations such as cropping, rotation, contrast enhancement, or numerical calculations.
The generated code is executed within a secure sandbox environment that handles formatting, error correction, and boundary conditions automatically. This sandbox reduces the model's coding burden by fixing minor issues like indentation problems or out-of-bounds cropping coordinates without affecting the code's functionality. The execution results are then fed back to the model for further analysis and final answer generation.
The training process consists of two main stages. The first stage involves Supervised Fine-Tuning (SFT) on a carefully curated dataset of 500,000 samples covering various scenarios from simple direct answers to complex multi-turn interactions. The data construction pipeline leverages over 4 million raw samples, filtering and verifying code execution quality through both automated sandbox testing and expert review.
The second stage employs Reinforcement Learning with a novel algorithm called GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling). This algorithm addresses a key challenge in code generation by applying different sampling temperatures for text reasoning (temperature = 1.0 for exploration) versus code generation (temperature = 0.0 for precision). This prevents code execution failures due to random characters or formatting errors while still encouraging diverse reasoning approaches.
Results
Comprehensive evaluations across nearly 20 benchmarks demonstrate significant and consistent performance improvements. In perception tasks, Thyme shows substantial gains over baseline models, particularly in challenging scenarios like high-resolution image analysis where it outperforms even larger models like Qwen2.5-VL-32B. For reasoning tasks, the method achieves notable improvements by converting complex mathematical computations into executable code. The approach also reduces hallucination rates and improves performance across general multimodal tasks. Notably, the SFT training phase requires only about 200 GPU hours to activate the model's fundamental capabilities, demonstrating computational efficiency.
Conclusion
Thyme presents a comprehensive solution for enabling multimodal large language models to perform sophisticated image manipulations and computations through autonomous code generation. The two-stage training approach, combined with the adaptive temperature sampling algorithm and secure sandbox environment, creates a system that balances rich functionality with high autonomy. The method achieves consistent performance improvements across diverse benchmarks while maintaining computational efficiency.
For more information please consult the full paper.
Congrats to the authors for their work!
Zhang, Yi-Fan, et al. "Thyme: Think Beyond Images." arXiv preprint arXiv:2508.11630, 2025.