Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Failure Prediction at Runtime for Generative Robot Policies

ChoirRec: Semantic User Grouping via LLMs for Conversion Rate Prediction of Low-Activity Users

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

DInfer: An Efficient Inference Framework for Diffusion Language Models

Formalizing Style in Personal Narratives

Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

Local MAP Sampling for Diffusion Models

HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

Native Hybrid Attention for Efficient Sequence Modeling

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Leveraging Large Language Models for Cybersecurity Risk Assessment -- A Case from Forestry Cyber-Physical Systems

TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Probing the Difficulty Perception Mechanism of Large Language Models

Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression

MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

Curved Boolean Logic: A Contextual Generalization of Propositional Logic with Algorithmic Consequences

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Diverse Text-to-Image Generation via Contrastive Noise Optimization

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

HAVIR: Hierarchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data

Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

The 2025 OpenAI Preparedness Framework does not guarantee any AI risk mitigation practices: a proof-of-concept for affordance analyzes of AI safety policies

Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models

Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Graph Your Own Prompt

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

ConQuER: Modular Architectures for Control and Bias Mitigation in IQP Quantum Generative Models

Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

Tree Search for LLM Agent Reinforcement Learning

Part-of-speech tagging for Nagamese Language using CRF

Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Distribution-Aligned Decoding for Efficient LLM Task Adaptation

ASTREA: Introducing Agentic Intelligence for Orbital Thermal Autonomy

Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

VL Norm: Rethink Loss Aggregation in RLVR

Long-Range Graph Wavelet Networks

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

REFRAG: Rethinking RAG based Decoding

Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

A Synthetic Dataset for Manometry Recognition in Robotic Applications

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Generative artificial intelligence improves projections of climate extremes

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Agentic large language models improve retrieval-based radiology question answering

Vision-Language Cross-Attention for Real-Time Autonomous Driving

Goal-Based Vision-Language Driving

Efficient Compositional Multi-tasking for On-device Large Language Models

Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks

{S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation

Learning Representations of Event Time Series with Sparse Autoencoders for Anomaly Detection, Similarity Search, and Unsupervised Classification

A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Learning Diffusion Models with Flexible Representation Guidance

Simulating Three-dimensional Turbulence with Physics-informed Neural Networks

Train-before-Test Harmonizes Language Model Rankings

MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

GradMetaNet: An Equivariant Architecture for Learning on Gradients

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

The Hidden Link Between RLHF and Contrastive Learning

ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

On Convolutions, Intrinsic Dimensions, and Diffusion Models

Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Ignition Phase: Standard Training for Fast Adversarial Robustness

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

Monotone and Conservative Policy Iteration Beyond the Tabular Case

Train-before-Test Harmonizes Language Model Rankings

Created by

Haebom

Author

Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

Outline

Existing language model benchmarks provide conflicting model rankings, making model selection and comparison difficult. This paper compares model potential using a "train-before-test" approach, which applies identical benchmark-specific fine-tuning to each model. Through extensive experiments on 24 benchmarks and 61 models, we demonstrate that model potential rankings based on train-before-test are consistent across benchmarks. Furthermore, train-before-test restores the relationship between perplexity and downstream task performance, which was lost in conventional evaluations, and reveals that model potential is governed by a single latent factor. We recommend train-before-test as a fundamental element of LLM benchmarking.

Takeaways, Limitations

•

Takeaways:

◦

Train-before-test ensures consistency in model potential rankings.

◦

Train-before-test restores the relationship between perplexity and downstream performance.

◦

Train-before-test reveals single-factor dominance of model potential.

◦

We propose a basic application of train-before-test for LLM benchmarking.

•

Limitations:

◦

There is no Limitations specified in the paper.

Made with Slashpage