Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

What Drives Compositional Generalization in Visual Generative Models?

Relevance-Aware Thresholding in Online Conformal Prediction for Time Series

MINERVA: Mutual Information Neural Estimation for Supervised Feature Selection

Pretraining with hierarchical memories: separating long-tail and common knowledge

Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Comparing Contrastive and Triplet Loss: Variance Analysis and Optimization Behavior

Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models

NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications

PolySim: Bridging the Sim-to-Real Gap for Humanoid Control via Multi-Simulator Dynamics Randomization

Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

MG2FlowNet: Accelerating High-Reward Sample Generation via Enhanced MCTS and Greediness Control

LLM-MCoX: Large Language Model-based Multi-robot Coordinated Exploration and Search

Auto-ARGUE: LLM-Based Report Generation Evaluation

Muon Outperforms Adam in Tail-End Associative Memory Learning

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

Autonomy-Aware Clustering: When Local Decisions Supersede Global Prescriptions

HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling

Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

Artificial Authority: From Machine Minds to Political Alignments. An Experimental Analysis of Democratic and Autocratic Biases in Large-Language Models

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

Generating High-Quality Datasets for Code Editing via Open-Source Language Models

Jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking

SafeFlowMatcher: Safe and Fast Planning using Flow Matching with Control Barrier Functions

Uncovering Grounding IDs: How External Cues Shape Multi-Modal Binding

FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning

Interpreting deep learning-based stellar mass estimation via causal analysis and mutual information decomposition

HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing

Boundary on the Table: Efficient Black-Box Decision-Based Attacks for Structured Data

Prompt-aware classifier free guidance for diffusion models

Active Attacks: Red-teaming LLMs via Adaptive Environments

Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Fine-Grained AI Model Caching and Downloading With Coordinated Multipoint Broadcasting in Multi-Cell Edge Networks

Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

The Narcissus Hypothesis: Descending to the Rung of Illusion

Rethinking the Role of Text Complexity in Language Model Pretraining

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

FedMentor: Domain-Aware Differential Privacy for Heterogeneous Federated LLMs in Mental Health

MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data

Fun-ASR Technical Report

Population-Aligned Persona Generation for LLM-based Social Simulation

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Time2time: Causal Intervention in Hidden States to Simulate Rare Events in Time Series Foundation Models

A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing

Post-training Large Language Models for Diverse High-Quality Responses

Attention as an Adaptive Filter

INGRID: Intelligent Generative Robotic Design Using Large Language Models

Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

Mixture of Contexts for Long Video Generation

Flexible metadata harvesting for ecology using large language models

Emotional Manipulation by AI Companions

SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation

Negative Shanshui: Real-time Interactive Ink Painting Synthesis

On Zero-Shot Reinforcement Learning

OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

TSLA: A Task-Specific Learning Adaptation for Semantic Segmentation on Autonomous Vehicles Platform

Street Review: A Participatory AI-Based Framework for Assessing Streetscape Inclusivity

Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

First Hallucination Tokens Are Different from Conditional Ones

Solar Photovoltaic Assessment with Large Language Model

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

Thought Purity: A Defense Framework For Chain-of-Thought Attack

MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering

TolerantECG: A Foundation Model for Imperfect Electrocardiogram

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models

Using cognitive models to reveal value trade-offs in language models

Refactoring Codebases through Library Design

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Towards Understanding Bias in Synthetic Data for Evaluation

Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering

Micro-Act: Mitigating Knowledge Conflict in LLM-based RAG via Actionable Self-Reasoning

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

SALAD: Systematic Assessment of Machine Unlearning on LLM-Aided Hardware Design

In-Context Learning for Pure Exploration

FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens

RFCAudit: An LLM Agent for Functional Bug Detection in Network Protocols

The Security Threat of Compressed Projectors in Large Vision-Language Models

Rethinking Exact Unlearning under Exposure: Extracting Forgotten Data under Exact Unlearning in Large Language Model

Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education

CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Local Stability and Region of Attraction Analysis for Neural Network Feedback Systems under Positivity Constraints

What Has Been Lost with Synthetic Evaluation?

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Created by

Haebom

Author

Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson

Outline

While LLM-based benchmarks are widely used to evaluate complex model behavior, they introduce failure modes not present in traditional correct-response benchmarks. This paper argues that without a rigorous objective and verifiable constructs, benchmark rankings can generate highly reliable rankings that are effectively noisy. The authors propose two mechanisms to diagnose this problem. Schema compliance quantifies the extent to which a rater's overall verdict is explained by their explicit evaluation schema, revealing unexplained variance when raters deviate from their own rubrics. Psychometric validity quantifies the irreducible uncertainty of a benchmarking exercise by aggregating internal consistency and discriminant validity signals. Applying these tools to Arena-Hard Auto, the authors found significant schema inconsistency and factor collapse across widely used raters. For example, DeepSeek-R1-32B exhibited over 90% unexplained variance and factor correlations greater than 0.93 for most criteria. They also demonstrate that ELO-style aggregation collapses and obscures true ranking uncertainty. These results highlight design flaws that compromise validity and provide actionable principles for building reliability-aware LLM-based benchmarks with better coverage.

Takeaways, Limitations

•

We highlight design issues with LLM-based benchmarks: their rankings can be noisy due to strict objectives and lack of verifiable constructs.

•

Suggesting a diagnostic mechanism: Evaluating the reliability of the benchmark using schema compliance and psychometric validity.

•

Analysis of Arena-Hard Auto: Finding serious schema inconsistencies and factor collapse, and pointing out problems with ELO-style aggregation.

•

Directions for improvement: Proposing principles for building LLM-based benchmarks with better scope and reliability.

•

Limitations: Focuses on analysis of a specific benchmark (Arena-Hard Auto).

View PDF

Made with Slashpage