Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Symmetric Behavior Regularization via Taylor Expansion of Symmetry

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective

Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Personalized Safety Alignment for Text-to-Image Diffusion Models

Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Efficient Attention Mechanisms for Large Language Models: A Survey

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Diffusion Beats Autoregressive in Data-Constrained Settings

Generative Multi-Target Cross-Domain Recommendation

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

$\Texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

AI Agent Smart Contract Exploit Generation

Sign Spotting Disambiguation using Large Language Models

Can Vision Language Models Understand Mimed Actions?

Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

EarthSynth: Generating Informative Earth Observation with Diffusion Models

RLSR: Reinforcement Learning from Self Reward

Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Explainable Recommendation with Simulated Human Feedback

Probabilistic Stability Guarantees for Feature Attributions

JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing

Towards Personalized Conversational Sales Agents: Contextual User Profiling for Strategic Action

Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization

Predicting the Lifespan of Industrial Printheads with Survival Analysis

R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Teaching LLMs How to Learn with Contextual Fine-Tuning

Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

GNN-Enhanced Fault Diagnosis Method for Parallel Cyber-physical Attacks in Power Grids

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

RLTHF: Targeted Human Feedback for LLM Alignment

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Rationale-guided Prompting for Knowledge-based Visual Question Answering

AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

PL-DCP: A Pairwise Learning framework with Domain and Class Prototypes for EEG emotion recognition under unseen target conditions

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Medal Matters: Probing LLMs' Failure Cases Through Olympic Rankings

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

SincVAE: A new semi-supervised approach to improve anomaly detection on EEG data using SincNet and variational autoencoder

CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow

CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Getting out of the Big-Muddy: Escalation of Commitment in LLMs

NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

DSBC: Data Science task Benchmarking with Context engineering

Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Style-Preserving Policy Optimization for Game Agents

Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Created by

Haebom

Author

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

Outline

This paper addresses the limitations of vision language models (VLMs) in understanding spatiotemporal interactions. Existing VLMs struggle to understand object motion, rotation, and viewpoint changes, which are essential capabilities for understanding dynamic real-world situations. Therefore, we present VLM4D, a novel benchmark for evaluating the spatiotemporal reasoning capabilities of VLMs. VLM4D consists of a variety of real and synthetic videos and carefully constructed question-answer pairs, emphasizing translational and rotational motion, viewpoint awareness, and motion continuity. A comprehensive evaluation of state-of-the-art VLMs reveals significant performance gaps compared to human benchmarks, highlighting fundamental deficiencies in existing models. Our analysis reveals that VLMs struggle to integrate multiple visual cues and maintain temporal coherence. We also explore promising directions, such as 4D feature field reconstruction and fine-tuning goal-directed spatiotemporal supervised learning, demonstrating their effectiveness in enhancing spatiotemporal understanding. This study aims to encourage further exploration of spatial and temporal enhancements to VLMs, towards more capable and reliable visual intelligence for dynamic environments.

Takeaways, Limitations

•

Takeaways:

◦

A new benchmark, VLM4D, is presented to evaluate the spatiotemporal reasoning capabilities of VLMs.

◦

Clearly present and identify the limitations of existing VLMs' spatiotemporal understanding capabilities.

◦

Promising directions for improving spatiotemporal understanding, including 4D feature field reconstruction and fine-tuning goal-oriented spatiotemporal map learning.

◦

Suggesting research directions for developing more advanced visual intelligence in dynamic environments.

•

Limitations:

◦

The VLM4D benchmark is still in its early stages and needs to be expanded to include more diverse and complex scenarios.

◦

The effectiveness of the proposed improvements may be limited to specific datasets or models.

◦

There are still significant technological challenges to achieving human-level spatiotemporal reasoning abilities.

View PDF

Made with Slashpage