NewMind AI Journal #61
TTRL: Test-Time Reinforcement Learning
By Yuxin Zuo, Kaiyan Zhang, Shang Qu, et al.
📌 Large Language Models (LLMs) struggle with reasoning tasks on unlabeled data at inference time.
📌 Test-Time Scaling (TTS) improves performance by increasing compute during inference, but applying Reinforcement Learning (RL) typically requires ground-truth labels.
📌 This article introduces Test-Time Reinforcement Learning (TTRL), a novel method enabling LLMs to train using RL directly on unlabeled test data by leveraging the model's own priors and a majority voting mechanism for reward estimation.
How It Works
TTRL operates without ground-truth labels. Given a prompt, the LLM generates multiple candidate outputs. A majority voting mechanism estimates a consensus answer, which serves as a pseudo-label. Rule-based rewards are then calculated based on how well each sampled output aligns with this estimated label. This reward signal is used to drive RL training, allowing the model parameters to adapt during inference and improve performance on distribution-shifted inputs.
Key Findings & Results
Experiments show TTRL consistently improves performance across tasks and models. Notably, Qwen-2.5-Math-7B saw a 159% pass@1 increase on AIME 2024 using only unlabeled test data. TTRL scales with model size and generalizes well to out-of-distribution tasks. Surprisingly, TTRL can surpass the performance ceiling of its own majority-voted training signal and approach the performance of models trained directly with ground-truth labels.
Why It Matters
TTRL represents a significant step towards enabling LLMs to self-evolve and learn continually on unlabeled data streams, substantially reducing reliance on expensive human annotations. This is crucial for handling complex, newly emerging real-world tasks. While effective, TTRL's success depends on sufficient prior knowledge in the backbone model and careful hyperparameter tuning, highlighting areas for future robustness improvements.
Our Insight
TTRL's ability to achieve significant performance gains through self-supervision on test data is quite impressive. The finding that it can exceed its own training signal suggests a powerful self-reinforcing loop. This work opens exciting possibilities for unbounded lifelong learning in LLMs, particularly for tasks where ground-truth labels are scarce or impossible to obtain.
Source: April 22, 2025 "TTRL: Test-Time Reinforcement Learning" Yuxin Zuo, Kaiyan Zhang, Shang Qu, et al., Tsinghua University, Shanghai AI Lab
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
By Guo Chen, Zhiqi Li, Shihao Wang, et al.
📌 While Vision-Language Models (VLMs) have advanced significantly, they often struggle with understanding long videos and high-resolution images due to limitations in processing extensive visual contexts.
📌 This article introduces Eagle 2.5, a family of VLMs specifically designed and trained for long-context multimodal learning, aiming to provide a generalist framework for tackling these challenging tasks efficiently.
How It Works
Eagle 2.5 employs a novel training framework featuring "information-first sampling" and "progressive training." Information-first sampling includes Image Area Preservation for high-resolution images and Automatic Degradation Sampling to dynamically balance visual and textual inputs. Progressive training incrementally increases the context length during training. The work also introduces Eagle-Video-110K, a new dataset with dual story-level and clip-level annotations to specifically enhance long video understanding.
Key Findings & Results
Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks. The 8B parameter model achieves 72.4% on Video-MME with 512 frames, matching the performance of much larger models like GPT-4o and 72B/78B parameter open-source models. Ablation studies confirm that the proposed sampling strategies, progressive training, and the Eagle-Video-110K dataset are crucial for these performance gains, showing consistent scaling with increased input length.
Why It Matters
This research provides a robust solution to a critical limitation of current VLMs – handling long and high-resolution inputs. By achieving state-of-the-art performance with a significantly smaller model size, Eagle 2.5 paves the way for more efficient and accessible long-context VLMs. The novel training strategies and the new dataset set a strong foundation for future research in scalable multimodal understanding for complex real-world scenarios.
Our Insight
Eagle 2.5's ability to compete with models many times its size on long-context video benchmarks is a compelling result. The focus on efficient training strategies and a tailored dataset highlights the importance of data and methodology alongside model scaling. This work is a valuable contribution towards making advanced multimodal AI more practical and widely deployable.
Source: April 22, 2025 "Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models" Guo Chen, Zhiqi Li, Shihao Wang, et al., Nanjing University, NVIDIA, The Hong Kong Polytechnic University, Rutgers University
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
By Minghao Wu, Weixuan Wang, Sinuo Liu, et al.
📌 As Large Language Models (LLMs) become increasingly multilingual, evaluating their capabilities across diverse languages and cultures is critical for equitable technological progress.
📌 This article presents a comprehensive analysis of over 2,000 non-English multilingual benchmarks published between 2021 and 2024 to understand the current state, identify limitations, and propose directions for more effective evaluation practices.
Methodology and Benchmark Analysis
The researchers collected and annotated 2,024 papers from arXiv focusing on multilingual benchmarks. They analyzed historical trends (PAST) regarding language distribution, task types, translation methods, dataset sizes, domains, and geographical development centers. They then assessed the current state (PRESENT) by comparing LLM performance on benchmarks with human judgments across multiple languages and evaluating the effectiveness of translated versus localized benchmarks.
Key Findings & Results
The analysis reveals significant biases: English remains overrepresented, high-resource languages dominate, and most benchmarks use original language content rather than translations. While dataset sizes and costs are increasing (estimated >$11M), high-value domains like healthcare and law are underrepresented. Crucially, the study found that STEM-related tasks correlate highly with human judgments (0.70-0.85), but traditional NLP tasks correlate poorly (0.11-0.30). Localized benchmarks (0.68 correlation) align much better with human judgment than translated ones (0.47-0.49).
The Bitter Lesson
This paper delivers a "bitter lesson," demonstrating that current multilingual benchmarks often fail to accurately reflect real-world performance and human perception, particularly when relying on translations. It underscores the urgent need for culturally authentic, practically relevant, and linguistically diverse benchmarks. The findings highlight critical gaps in evaluating Natural Language Generation and low-resource languages, calling for global collaboration to develop more robust and equitable evaluation frameworks.
Our Insight
The sheer scale of this analysis provides compelling evidence of the biases and shortcomings in current multilingual evaluation. The finding that translated benchmarks correlate poorly with human judgment is a stark reminder that language is deeply intertwined with culture and context. This work is essential reading for anyone involved in building or evaluating multilingual AI, offering a clear roadmap for moving towards more meaningful and equitable assessments.
Source: April 22, 2025 "The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks" Minghao Wu, Weixuan Wang, Sinuo Liu, et al., Alibaba International Digital Commerce, Monash University, The University of Edinburgh, Tsinghua University, Universität Hamburg