Thyme: Think Beyond Images

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Published Aug 18, 2025

Today's paper introduces Thyme (Think Beyond Images), a novel approach that enables multimodal large language models to autonomously generate and execute code for diverse image processing operations and mathematical computations. Unlike existing "think with images" methods that are limited to simple cropping or image generation, Thyme provides rich functionality including rotation, contrast enhancement, zooming, and complex calculations while maintaining high autonomy in deciding when and how to apply these operations.

Method Overview

Thyme operates through a pipeline where the model first analyzes a given problem and determines whether code generation is necessary. If the problem is simple enough, it provides a direct answer. However, for complex scenarios requiring image manipulation or mathematical computation, the model autonomously generates Python code to perform operations such as cropping, rotation, contrast enhancement, or numerical calculations.

The generated code is executed within a secure sandbox environment that handles formatting, error correction, and boundary conditions automatically. This sandbox reduces the model's coding burden by fixing minor issues like indentation problems or out-of-bounds cropping coordinates without affecting the code's functionality. The execution results are then fed back to the model for further analysis and final answer generation.

The training process consists of two main stages. The first stage involves Supervised Fine-Tuning (SFT) on a carefully curated dataset of 500,000 samples covering various scenarios from simple direct answers to complex multi-turn interactions. The data construction pipeline leverages over 4 million raw samples, filtering and verifying code execution quality through both automated sandbox testing and expert review.

The second stage employs Reinforcement Learning with a novel algorithm called GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling). This algorithm addresses a key challenge in code generation by applying different sampling temperatures for text reasoning (temperature = 1.0 for exploration) versus code generation (temperature = 0.0 for precision). This prevents code execution failures due to random characters or formatting errors while still encouraging diverse reasoning approaches.

Results

Comprehensive evaluations across nearly 20 benchmarks demonstrate significant and consistent performance improvements. In perception tasks, Thyme shows substantial gains over baseline models, particularly in challenging scenarios like high-resolution image analysis where it outperforms even larger models like Qwen2.5-VL-32B. For reasoning tasks, the method achieves notable improvements by converting complex mathematical computations into executable code. The approach also reduces hallucination rates and improves performance across general multimodal tasks. Notably, the SFT training phase requires only about 200 GPU hours to activate the model's fundamental capabilities, demonstrating computational efficiency.

Conclusion

Thyme presents a comprehensive solution for enabling multimodal large language models to perform sophisticated image manipulations and computations through autonomous code generation. The two-stage training approach, combined with the adaptive temperature sampling algorithm and secure sandbox environment, creates a system that balances rich functionality with high autonomy. The method achieves consistent performance improvements across diverse benchmarks while maintaining computational efficiency.

For more information please consult the full paper.

Congrats to the authors for their work!

Zhang, Yi-Fan, et al. "Thyme: Think Beyond Images." arXiv preprint arXiv:2508.11630, 2025.

Thyme: Think Beyond Images

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

Conclusion

AI Paper of the Day

1,523 follower

More articles by this author

Explore topics

Method Overview

Results

Conclusion

AI Paper of the Day

1,523 follower

Intern-S1: A Scientific Multimodal Foundation Model

Aug 23, 2025

Mobile-Agent-v3: Foundamental Agents for GUI Automation

Aug 22, 2025

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Aug 21, 2025

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Aug 20, 2025

Ovis2.5 Technical Report

Aug 19, 2025

WideSearch: Benchmarking Agentic Broad Info-Seeking

Aug 17, 2025

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Aug 16, 2025

Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

Aug 15, 2025

DINOv3: Self-supervised learning for vision at unprecedented scale

Aug 14, 2025

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Aug 13, 2025

Explore topics