Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Vector-Valued Reproducing Kernel Banach Spaces for Neural Networks and Operators

Auto-ARGUE: LLM-Based Report Generation Evaluation

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Dolphin v1.0 Technical Report

Economic Competition, EU Regulation, and Executive Orders: A Framework for Discussing AI Policy Implications in CS Courses

Multi-modal Spatio-Temporal Transformer for High-resolution Land Subsidence Prediction

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

The Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

An Agent-Based Framework for Automated Higher-Voice Harmony Generation

Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Benchmarking LLM - Assisted Blue Teaming via Standardized Threat Hunting

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

LLM Watermark Evasion via Bias Inversion

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Explaining multimodal LLMs via intra-modal token interactions

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Integrated Framework for LLM Evaluation with Answer Generation

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

RoVerFly: Robust and Versatile Implicit Hybrid Control of Quadrotor-Payload Systems

An Ethically Grounded LLM-Based Approach to Insider Threat Synthesis and Detection

HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

Post Hoc Regression Refinement via Pairwise Rankings

On Task Vectors and Gradients

Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

Neural Logic Networks for Interpretable Classification

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

First Hallucination Tokens Are Different from Conditional Ones

Nonlinear Framework for Speech Bandwidth Extension

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Fair CCA for Fair Representation Learning: An ADNI Study

Model Parallelism With Subnetwork Data Parallelism

Towards a Progress Bar for Reasoning: Progress Prediction in Large Reasoning Models

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

MS-DFTVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Deformable Convolution

Training-free LLM Verification via Recycling Few-shot Examples

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

AS400-DET: Detection using Deep Learning Model for IBM i (AS/400)

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Estimating Visceral Adiposity from Wrist-Worn Accelerometry

MLLM-CL: Continual Learning for Multimodal Large Language Models

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Object Centric Concept Bottlenecks

The challenge of hidden gifts in multi-agent reinforcement learning

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Steering LLM Reasoning Through Bias-Only Adaptation

Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Learning to Rank Chain-of-Thought: Using a Small Model

Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

Learning Hierarchical Domain Models Through Environment-Grounded Interaction

A Physics-Inspired Optimizer: Velocity Regularized Adam

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

BlobCtrl: Taming Controllable Blob for Element-level Image Editing

Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making

DISCOVER: Data-driven Identification of Sub-activities via Clustering and Visualization for Enhanced Activity Recognition in Smart Homes

Toward Foundational Model for Sleep Analysis Using a Multimodal Hybrid Self-Supervised Learning Framework

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes

Distilling Calibration via Conformalized Credal Inference

Exploring and Controlling Diversity in LLM-Agent Conversation

Stability Bounds for the Unfolded Forward-Backward Algorithm

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

LLM-guided Task and Motion Planning using Knowledge-based Reasoning

XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

3D Interaction Geometric Pre-training for Molecular Relational Learning

Balancing Multimodal Training Through Game-Theoretic Regularization

A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Graphon Particle Systems, Part II: Dynamics of Distributed Stochastic Continuum Optimization

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Hot PATE: Private Aggregation of Distributions for Diverse Task

Adversarial Attacks to Latent Representations of Distributed Neural Networks in Split Computing

Learning Dynamic Graph Embeddings with Neural Controlled Differential Equations

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Communicating-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

Interactive Learning for LLM Reasoning

ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning

Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

It's Not That Simple. An Analysis of Simple Test-Time Scaling

Created by

Haebom

Author

Guojun Wu

Outline

This paper presents an analysis of a simple test-time scaling technique that replicates the scaling behavior of models distilled from o1-like models by manually adjusting the test-time computational complexity. The analysis reveals that the scaling behavior is primarily due to scaling down via maximum length constraints. In contrast, fine-tuning with long CoT data does not significantly affect the scaling behavior, and scaling up via adding “Wait” is inconsistent as the model can oscillate between solutions. There is an important distinction between scaling down via maximum length constraints and scaling up test-time computational complexity in o1-like models (e.g., DeepSeek-R1). While o1-like models are allowed to use as much computational complexity as they need, they are limited only by the maximum supported length of the model. By naturally learning to scale up test-time computational complexity during reinforcement learning, o1-like models outperform state-of-the-art models when scaling up. In contrast, simple test-time scaling gradually lowers the upper bound on model performance when scaling down. While it is easy to replicate the test-time scaling behavior of the o1 model by scaling down, it is important to recognize that the goal of test-time computation scaling is to achieve higher performance than what the model was originally capable of, not simply to reproduce the appearance of the scaling behavior.

Takeaways, Limitations

•

Takeaways: Deepens our understanding of the performance improvement mechanism of o1-like models by revealing that scaling down through maximum length constraints is the main cause of simple test-time scaling. Emphasizes that the true goal of test-time computation scaling is performance improvement.

•

Limitations: The problem of inconsistent scaling up through the addition of "Wait" was raised, but no specific measures were presented to improve it. The fact that the effect of fine-tuning using long CoT data was minimal suggests that further research is needed. There is a lack of detailed analysis of the scaling up mechanism of o1-like models.

View PDF

Made with Slashpage