Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Vector-Valued Reproducing Kernel Banach Spaces for Neural Networks and Operators

Auto-ARGUE: LLM-Based Report Generation Evaluation

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

Dolphin v1.0 Technical Report

Economic Competition, EU Regulation, and Executive Orders: A Framework for Discussing AI Policy Implications in CS Courses

Multi-modal Spatio-Temporal Transformer for High-resolution Land Subsidence Prediction

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

The Sandbox Configurator: A Framework to Support Technical Assessment in AI Regulatory Sandboxes

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Metaphor identification using large language models: A comparison of RAG, prompt engineering, and fine-tuning

An Agent-Based Framework for Automated Higher-Voice Harmony Generation

Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Benchmarking LLM - Assisted Blue Teaming via Standardized Threat Hunting

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

LLM Watermark Evasion via Bias Inversion

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Explaining multimodal LLMs via intra-modal token interactions

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Integrated Framework for LLM Evaluation with Answer Generation

On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

RoVerFly: Robust and Versatile Implicit Hybrid Control of Quadrotor-Payload Systems

An Ethically Grounded LLM-Based Approach to Insider Threat Synthesis and Detection

HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

Post Hoc Regression Refinement via Pairwise Rankings

On Task Vectors and Gradients

Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

Neural Logic Networks for Interpretable Classification

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

First Hallucination Tokens Are Different from Conditional Ones

Nonlinear Framework for Speech Bandwidth Extension

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Fair CCA for Fair Representation Learning: An ADNI Study

Model Parallelism With Subnetwork Data Parallelism

Towards a Progress Bar for Reasoning: Progress Prediction in Large Reasoning Models

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

MS-DFTVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Deformable Convolution

Training-free LLM Verification via Recycling Few-shot Examples

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

AS400-DET: Detection using Deep Learning Model for IBM i (AS/400)

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Estimating Visceral Adiposity from Wrist-Worn Accelerometry

MLLM-CL: Continual Learning for Multimodal Large Language Models

Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Object Centric Concept Bottlenecks

The challenge of hidden gifts in multi-agent reinforcement learning

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Steering LLM Reasoning Through Bias-Only Adaptation

Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Learning to Rank Chain-of-Thought: Using a Small Model

Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

Learning Hierarchical Domain Models Through Environment-Grounded Interaction

A Physics-Inspired Optimizer: Velocity Regularized Adam

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models

BlobCtrl: Taming Controllable Blob for Element-level Image Editing

Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing

Addressing Moral Uncertainty using Large Language Models for Ethical Decision-Making

DISCOVER: Data-driven Identification of Sub-activities via Clustering and Visualization for Enhanced Activity Recognition in Smart Homes

Toward Foundational Model for Sleep Analysis Using a Multimodal Hybrid Self-Supervised Learning Framework

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes

Distilling Calibration via Conformalized Credal Inference

Exploring and Controlling Diversity in LLM-Agent Conversation

Stability Bounds for the Unfolded Forward-Backward Algorithm

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

LLM-guided Task and Motion Planning using Knowledge-based Reasoning

XRZoo: A Large-Scale and Versatile Dataset of Extended Reality (XR) Applications

3D Interaction Geometric Pre-training for Molecular Relational Learning

Balancing Multimodal Training Through Game-Theoretic Regularization

A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Graphon Particle Systems, Part II: Dynamics of Distributed Stochastic Continuum Optimization

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Hot PATE: Private Aggregation of Distributions for Diverse Task

Adversarial Attacks to Latent Representations of Distributed Neural Networks in Split Computing

Learning Dynamic Graph Embeddings with Neural Controlled Differential Equations

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Communicating-Efficient and Accurate Approach for Aggregation in Federated Low-Rank Adaptation

Interactive Learning for LLM Reasoning

ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning

Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Created by

Haebom

Author

Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Ya ir Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson, Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng

CritPt: Complex Research using Integrated Thinking - Physics Test

Outline

This paper aims to answer the questions of whether LLMs can effectively reason about complex, open problems in cutting-edge physics research and what types of reasoning tasks physicists want their LLMs to support. To this end, we present CritPt (Complex Research using Integrated Thinking - Physics Test), the first benchmark designed to test unpublished research-level reasoning tasks. CritPt consists of 71 complex research problems spanning modern physics research areas, including condensed matter, quantum physics, atomic, molecular, and optical physics, astrophysics, high-energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics, and biophysics. Each problem was created by physicists and evaluated through an automated scoring pipeline. While the current SOTA LLM shows initial promise in individual checkpoints, it is still far from reliably solving full-scale research problems. When equipped with a coding tool, GPT-5 (high) achieves an accuracy of approximately 10%. CritPt highlights the significant gap between the capabilities of current models and the needs of real-world physics research, providing a foundation for the development of scientifically grounded AI tools.

Takeaways, Limitations

•

Takeaways:

◦

CritPt provides a new benchmark for LLMs to assess their ability to tackle complex problems in real-world physics research.

◦

The current SOTA LLM demonstrates that it is struggling to address complex physics research challenges.

◦

We lay the foundation for developing AI-based tools and suggest scientifically grounded directions for developing AI tools.

•

Limitations:

◦

The current LLM performance is low, limiting its usefulness in practical research.

◦

CritPt's problems rely on the knowledge of physics experts, so problem creation and evaluation require expertise.

◦

The model's accuracy is low, requiring further research to improve the model.

View PDF

Made with Slashpage