Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

Artificial Generals Intelligence: Mastering Generals.io with Reinforcement Learning

HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning

Ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry

Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Solving the Hubbard model with Neural Quantum States

S2FGL: Spatial Spectral Federated Graph Learning

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Description of the Training Process of Neural Networks via Ergodic Theorem: Ghost nodes

A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search

Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal Prediction

Studying and Improving Graph Neural Network-based Motif Estimation

Learning Algorithms in the Limit

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

MAEBE: Multi-Agent Emergent Behavior Framework

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Anchoring AI Capabilities in Market Valuations: The Capability Realization Rate Model and Valuation Misalignment Risk

Fair Uncertainty Quantification for Depression Prediction

MF-LLM: Simulating Population Decision Dynamics via a Mean-Field Large Language Model Framework

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size

Toward Holistic Evaluation of Recommender Systems Powered by Generative Models

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Ethical Concerns of Generative AI and Mitigation Strategies: A Systematic Mapping Study

Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval

Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian Process

Cosmos World Foundation Model Platform for Physical AI

Enhancing Transformers for Generalizable First-Order Logical Entailment

Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid Robots for Multimodal Understanding

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Understanding Chain-of-Thought in LLMs through Information Theory

A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines

MarineFormer: A Spatio-Temporal Attention Model for USV Navigation in Dynamic Marine Environments

HARMONIC: Cognitive and Control Collaboration in Human-Robotic Teams

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Masked Image Modeling: A Survey

Time Makes Space: Emergence of Place Fields in Networks Encoding Temporally Continuous Sensory Experiences

Curriculum Negative Mining For Temporal Networks

C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Solving Probabilistic Verification Problems of Neural Networks using Branch and Bound

Offline Trajectory Optimization for Offline Reinforcement Learning

Structure Guided Large Language Model for SQL Generation

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Unsupervised Automata Learning via Discrete Optimization

Don't Get Me Wrong: How to Apply Deep Visual Interpretations to Time Series

An Algorithm for Learning Smaller Representations of Models With Scarce Data

GTA1: GUI Test-time Scaling Agent

Fuzzy Classification Aggregation for a Continuum of Agents

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

AI's Euclid's Elements Moment: From Language Models to Computable Thought

Closer to Language than Steam: AI as the Cognitive Engine of a New Productivity Revolution

Access Controls Will Solve the Dual-Use Dilemma

Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning

Affordable AI Assistants with Knowledge Graph of Thoughts

Deontic Temporal Logic for Formal Verification of AI Ethics

Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading

Constrain Alignment with Sparse Autoencoders

Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: A Mixed Approach on Multi-Layer Networks

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

PyVision: Agentic Vision with Dynamic Tooling

Single-pass Adaptive Image Tokenization for Minimum Program Search

Multigranular Evaluation for Brain Visual Decoding

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

EXPO: Stable Reinforcement Learning with Expressive Policies

Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Why is Your Language Model a Poor Implicit Reward Model?

Reinforcement Learning with Action Chunking

Scaling RL to Long Videos

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Low Resource Reconstruction Attacks Through Benign Prompts

Probing Experts' Perspectives on AI-Assisted Public Speaking Training

Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice

DTECT: Dynamic Topic Explorer & Context Tracker

Agentic Retrieval of Topics and Insights from Earnings Calls

Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Created by

Haebom

Author

Edoardo Loru, Jacopo Nudo, Niccol o Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi

Outline

As large-scale language models (LLMs) are increasingly integrated into workflows that involve evaluation processes, this paper addresses the need to investigate how these evaluations are constructed, what assumptions they rely on, and how they differ from human strategies. The study benchmarks six LLMs against NewsGuard and Media Bias/Fact Check (MBFC) expert evaluations and human judgments collected through controlled experiments. It implements a structured, goal-oriented framework in which both models and non-expert participants follow the same evaluation procedure (criteria selection, content retrieval, and justification generation), enabling direct comparisons. Despite their consistent outputs, LLMs rely on different mechanisms, such as lexical associations and statistical prior knowledge replacing contextual inference. This reliance produces systematic effects, such as political asymmetry, opaque justification, and a tendency to confuse linguistic form with epistemic validity. Delegating judgment to LLMs is therefore not simply automating evaluation, but rather redefining evaluation from normative reasoning to pattern-based approximation.

Takeaways, Limitations

•

Takeaways: Promotes a deeper discussion of the reliability and ethical implications of LLM-based assessment systems by clearly demonstrating the systematic biases and limitations that arise when LLMs are integrated into the assessment process. By increasing understanding of the LLM judgment mechanism, it can contribute to the development of more accurate and fair assessment systems.

•

Limitations: This study is limited to a specific LLM and assessment tool, and generalization to other LLMs or assessment domains may be limited. It is difficult to completely rule out the subjectivity and inconsistency of human judgment. Since the mechanism of LLM is not fully explained, further research is needed to fully identify the root cause of systematic bias.

View PDF

Made with Slashpage