Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning

A Comprehensive Framework for Uncertainty Quantification of Voxel-wise Supervised Models in IVIM MRI

Symmetric Behavior Regularization via Taylor Expansion of Symmetry

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Human-Centered Human-AI Interaction (HC-HAII): A Human-Centered AI Perspective

Probing and Enhancing the Robustness of GNN-based QEC Decoders with Reinforcement Learning

LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems

SpectrumWorld: Artificial Intelligence Foundation for Spectroscopy

Personalized Safety Alignment for Text-to-Image Diffusion Models

Exploring the Feasibility of Deep Learning Techniques for Accurate Gender Classification from Eye Images

Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition

Efficient Attention Mechanisms for Large Language Models: A Survey

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Diffusion Beats Autoregressive in Data-Constrained Settings

Generative Multi-Target Cross-Domain Recommendation

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

$\Texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

AI Agent Smart Contract Exploit Generation

Sign Spotting Disambiguation using Large Language Models

Can Vision Language Models Understand Mimed Actions?

Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance

Unsupervised deep learning model for fast energy layer pre-selection of delivery-efficient proton arc therapy plan optimization of nasopharyngeal carcinoma

Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

CountingFruit: Language-Guided 3D Fruit Counting with Semantic Gaussian Splatting

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

EarthSynth: Generating Informative Earth Observation with Diffusion Models

RLSR: Reinforcement Learning from Self Reward

Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Explainable Recommendation with Simulated Human Feedback

Probabilistic Stability Guarantees for Feature Attributions

JEPA4Rec: Learning Effective Language Representations for Sequential Recommendation via Joint Embedding Predictive Architecture

ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing

Towards Personalized Conversational Sales Agents: Contextual User Profiling for Strategic Action

Deep Learning Methods for Detecting Thermal Runaway Events in Battery Production Lines

Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization

Predicting the Lifespan of Industrial Printheads with Survival Analysis

R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

Teaching LLMs How to Learn with Contextual Fine-Tuning

Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

GNN-Enhanced Fault Diagnosis Method for Parallel Cyber-physical Attacks in Power Grids

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting

RLTHF: Targeted Human Feedback for LLM Alignment

Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask

Rationale-guided Prompting for Knowledge-based Visual Question Answering

AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

GuARD: Effective Anomaly Detection through a Text-Rich and Graph-Informed Language Model

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

PL-DCP: A Pairwise Learning framework with Domain and Class Prototypes for EEG emotion recognition under unseen target conditions

DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Medal Matters: Probing LLMs' Failure Cases Through Olympic Rankings

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

SincVAE: A new semi-supervised approach to improve anomaly detection on EEG data using SincNet and variational autoencoder

CrisisSense-LLM: Instruction Fine-Tuned Large Language Model for Multi-label Social Media Text Classification in Disaster Informatics

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow

CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Getting out of the Big-Muddy: Escalation of Commitment in LLMs

NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

DSBC: Data Science task Benchmarking with Context engineering

Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Hierarchical Budget Policy Optimization for Adaptive Reasoning

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Style-Preserving Policy Optimization for Game Agents

Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Created by

Haebom

Author

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

Outline

This paper presents a method for integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation to enhance the 3D perception performance of autonomous vehicles. Existing methods suffer from spatial misalignment between LiDAR and camera features, which leads to errors in accurate depth supervision of camera branches and cross-modal feature aggregation. This paper demonstrates that the root causes of these misalignments lie in calibration inaccuracies and projection errors caused by the rolling shutter effect. We note that these errors are predictably concentrated at object-background boundaries, which 2D detectors reliably identify. Therefore, our primary goal is to leverage 2D object prior information to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior-Guided Depth Calibration (PGDC), which utilizes 2D prior information to mitigate misalignment and maintain accurate cross-modal feature pairs. To address global alignment errors, we introduce Discontinuity-Aware Geometric Fusion (DAGF), which suppresses residual noise from PGDC and explicitly enhances distinct depth variations at object-background boundaries to generate structurally recognizable representations. To effectively utilize the aligned representations, we integrate the Structural Guidance Depth Modulator (SGDM), which efficiently fuses aligned depth and image features using a gated attention mechanism. The proposed method achieves state-of-the-art performance (mAP 71.5%, NDS 73.6%) on the nuScenes validation dataset.

Takeaways, Limitations

•

Takeaways:

◦

Presenting an effective solution to the spatial alignment error problem that occurs when fusing LiDAR and camera data.

◦

Improving the accuracy of cross-modal feature alignment by leveraging 2D object prior information.

◦

Structural recognition and accuracy improvement of BEV representation through PGDC, DAGF, and SGDM modules.

◦

Achieving SOTA performance on the nuScenes dataset

•

Limitations:

◦

The performance of the proposed method may be limited to a specific dataset (nuScenes).

◦

It may depend on the performance of the 2D object detector, meaning that errors in the 2D detector may affect the performance of the entire system.

◦

Further verification of generalization performance in real-world autonomous driving environments is needed.

◦

Further research is needed on computational complexity and real-time processing capabilities.

View PDF

Made with Slashpage