Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Input Time Scaling

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition

Biased AI improves human decision-making but reduces trust

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving

ETA: Energy-based Test-time Adaptation for Depth Completion

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Reinitializing weights vs units for maintaining plasticity in neural networks

Each to Their Own: Exploring the Optimal Embedding in RAG

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

TolerantECG: A Foundation Model for Imperfect Electrocardiogram

DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization

Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning

Multi-agent Auditory Scene Analysis

MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis

AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale

Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting

Spore in the Wild: A Case Study of Spore.fun as an Open-Environment Evolution Experiment with Sovereign AI Agents on TEE-Secured Blockchains

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data

Security Concerns for Large Language Models: A Survey

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

A Conceptual Framework for AI-based Decision Systems in Critical Infrastructures

Dominated Actions in Imperfect-Information Games

Hands-On: Segmenting Individual Signs from Continuous Sequences

PathGPT: Reframing Path Recommendation as a Natural Language Generation Task with Retrieval-Augmented Language Models

Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement

JudgeLRM: Large Reasoning Models as a Judge

Generative AI in K-12 Education: The CyberScholar Initiative

Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions

Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving

Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Action Engine: Automatic Workflow Generation in FaaS

The importance of visual modeling languages in generative software engineering

Identity Preserving 3D Head Stylization with Multiview Score Distillation

SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

Testing Components of the Attention Schema Theory in Artificial Neural Networks

A Little Human Data Goes A Long Way

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Social Debiasing for Fair Multi-modal LLMs

A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Enhancing Depression-Diagnosis-Oriented Chat with Psychological State Tracking

Estimation of Energy-dissipation Lower-bounds for Neuromorphic Learning-in-memory

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users

Nash Convergence of Mean-Based Learning Algorithms in First-Price Auctions

TASER: Table Agents for Schema-guided Extraction and Recommendation

Modeling Relational Logic Circuits for And-Inverter Graph Convolutional Network

EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

EoH-S: Evolution of Heuristic Set using LLMs for Automated Heuristic Design

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)

The NordDRG AI Benchmark for Large Language Models

Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Unsupervised Learning for Quadratic Assignment

Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Benchmarking graph construction by large language models for coherence-driven inference

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Graph Structure Learning with Temporal Graph Information Bottleneck for Inductive Representation Learning

$TIME[t] \subseteq SPACE[O(\sqrt{t})]$ via Tree Height Compression

Long Chain-of-Thought Reasoning Across Languages

From Passive Tool to Socio-cognitive Teammate: A Conceptual Framework for Agentic AI in Human-AI Collaborative Learning

Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

TransLight: Image-Guided Customized Lighting Control with Generative Decoupling

DINOv3 with Test-Time Training for Medical Image Registration

MF-LPR$^2$: Multi-Frame License Plate Image Restoration and Recognition using Optical Flow

TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

Reliable generation of isomorphic physics problems using ChatGPT with prompt-chaining and tool use

Cross-Modality Controlled Molecule Generation with Diffusion Language Model

Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

AFABench: A Generic Framework for Benchmarking Active Feature Acquisition

Emerson-Lei and Manna-Pnueli Games for LTLf+ and PPLTL+ Synthesis

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

Learning in Repeated Multi-Objective Stackelberg Games with Payoff Manipulation

Foe for Fraud: Transferable Adversarial Attacks in Credit Card Fraud Detection

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal

ELATE: Evolutionary Language model for Automated Time-series Engineering

OneLoc: Geo-Aware Generative Recommender Systems for Local Life Service

Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination

A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems

Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

Created by

Haebom

Author

Haiquan Hu, Jiazhi Jiang, Shiyou Xu, Ruhan Zeng, Tian Wang

Outline

To address the growing challenge of evaluating large-scale language models (LLMs), this paper proposes a novel evaluation framework, the Structured Transition Evaluation Method (STEM). STEM analyzes the performance variations of LLMs with identical architectures but different parameter sizes to identify significant transition samples (STS). These STS are then used to efficiently and interpretably estimate the performance of unknown models. Using the Qwen3 model, we build a pool of STSs across six diverse benchmarks. Experimental results demonstrate that STEM reliably captures model performance trends and matches ground-truth performance rankings. This highlights STEM as a practical and scalable method for fine-tuning and architecture-independent evaluation of LLMs.

Takeaways, Limitations

•

Takeaways:

◦

Presenting a novel method that can significantly improve the efficiency and interpretability of LLM assessment.

◦

Effectively solves the overfitting and high computational cost problems of existing benchmarks.

◦

Enables architecture-independent, fine-tuned LLM performance comparisons.

◦

Provides reliable evaluation results that closely match actual performance rankings.

•

Limitations:

◦

Dependency on the Qwen3 model used to build the STS pool. Further verification of generalization performance on LLMs with other architectures is needed.

◦

Further research is needed on the objectivity and generalizability of STS selection criteria.

◦

Further extensive experiments and validation of various types of LLMs are needed.

View PDF

Made with Slashpage