Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurofibromas in whole-body MRI

Soft Tokens, Hard Truths

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

COLT: Enhancing Video Large Language Models with Continual Tool Usage

Do You Need Proprioceptive States in Visuomotor Policies?

CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection

Self-Evolving LLMs via Continual Instruction Tuning

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Equip Pre-ranking with Target Attention by Residual Quantization

Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

Patterns in the Transition From Founder-Leadership to Community Governance of Open Source

Synthetic bootstrapped pretraining

PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization

Do Code Semantics Help? A Comprehensive Study on Execution Trace-Based Information for Code Large Language Models

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis

Beyond the Pre-Service Horizon: Infusing In-Service Behavior for Improved Financial Risk Forecasting

HumAIne-Chatbot: Real-Time Personalized Conversational AI via Reinforcement Learning

EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs

Do AI Companies Make Good on Voluntary Commitments to the White House?

Embedding Alignment in Code Generation for Audio

Kron-LoRA: Hybrid Kronecker-LoRA Adapters for Scalable, Sustainable Fine-tuning

From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs

Measuring Harmfulness of Computer-Using Agents

Enhancing RAG Efficiency with Adaptive Context Compression

CANDLE: A Cross-Modal Agentic Knowledge Distillation Framework for Interpretable Sarcopenia Diagnosis

Assay2Mol: large language model-based drug design using BioAssay context

Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation

White-Basilisk: A Hybrid Model for Code Vulnerability Detection

Energy Management for Renewable-Colocated Artificial Intelligence Data Centers

VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization

HAZEMATCHING: Dehazing Light Microscopy Images with Guided Conditional Flow Matching

Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs

Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation

CUPID: Curating Data your Robot Loves with Influence Functions

Quantum-Classical Hybrid Quantized Neural Network

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Why Do Some Inputs Break Low-Bit LLM Quantization?

A Quad-Step Approach to Uncertainty-Aware Deep Learning for Skin Cancer Classification

CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning

Urania: Differentially Private Insights into AI Use

RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset

To Trust Or Not To Trust Your Vision-Language Model's Prediction

SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

GSPRec: Temporal-Aware Graph Spectral Filtering for Recommendation

EDBench: Large-Scale Electron Density Data for Molecular Modeling

Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare

LEMUR Neural Network Dataset: Towards Seamless AutoML

Towards Visual Text Grounding of Multimodal Large Language Model

Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches

DP-LET: An Efficient Spatio-Temporal Network Traffic Prediction Framework

Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Challenges and Trends in Egocentric Vision: A Survey

Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models

Language Models Fail to Introspect About Their Knowledge of Language

Learning to Drive by Imitating Surrounding Vehicles

A Transformer Model for Predicting Chemical Products from Generic SMARTS Templates with Data Augmentation

Anomaly Detection in Complex Dynamical Systems: A Systematic Framework Using Embedding Theory and Physics-Inspired Consistency

Bridging Information Gaps with Comprehensive Answers: Improving the Diversity and Informativeness of Follow-Up Questions

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Compact Rule-Based Classifier Learning via Gradient Descent

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Representation Convergence: Mutual Distillation is Secretly a Form of Regularization

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Stylus: Repurposing Stable Diffusion for Training-Free Music Style Transfer on Mel-Spectrograms

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

A GEN AI Framework for Medical Note Generation

Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

Efficient Fine-Tuning of Large Language Models for Automated Medical Documentation

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

On the Integration of Spatial-Temporal Knowledge: A Lightweight Approach to Atmospheric Time Series Forecasting

DeNOTS: Stable Deep Neural ODEs for Time Series

TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

RealitySummary: Exploring On-Demand Mixed Reality Text Summarization and Question Answering using Large Language Models

CLIP Can Understand Depth

CueGCL: Cluster-aware Personalized Self-Training for Unsupervised Graph Contrastive Learning

Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity

Markov Decision Processes under External Temporal Processes

MAPO: Mixed Advantage Policy Optimization

Similarity Field Theory: A General Mathematical Framework for Intelligence

CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models

Plan Verification for LLM-Based Embodied Task Completion Agents

GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning

Compression Strategies for Efficient Multimodal LLMs in Medical Contexts

Emergent Risk Awareness in Rational Agents under Resource Constraints

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Created by

Haebom

저자

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi

개요

OmniSpatial은 인지 심리학에 기반한 종합적이고 어려운 공간 추론 벤치마크입니다. 동적 추론, 복잡한 공간 논리, 공간 상호작용, 관점 취하기 등 4가지 주요 범주와 50개의 세부 범주로 구성되며, 8,400개 이상의 질문-답변 쌍으로 이루어져 있습니다. 기존의 개방형 및 폐쇄형 소스 VLM들이 포괄적인 공간 추론에 상당한 한계를 보임을 실험을 통해 보여주고, 공간 추론을 강화하기 위한 PointGraph(명시적 장면 그래프 단서) 및 SpatialCoT(새로운 관점의 사고 연쇄)라는 두 가지 전략을 탐구합니다.

시사점, 한계점

•

시사점:

◦

기존 VLM의 공간 추론 능력의 한계를 명확히 보여주는 새로운 벤치마크 OmniSpatial 제시.

◦

공간 추론 향상을 위한 PointGraph와 SpatialCoT 전략 제안.

◦

인지 심리학에 기반한 보다 포괄적이고 복잡한 공간 추론 과제 제시.

•

한계점:

◦

OmniSpatial이 아직 초기 단계의 벤치마크이므로, 향후 더욱 다양하고 복잡한 공간 추론 과제 추가가 필요할 수 있음.

◦

제안된 PointGraph와 SpatialCoT 전략의 일반화 성능 및 효율성에 대한 추가적인 연구가 필요함.

◦

현재 벤치마크의 규모가 더욱 확장될 필요가 있을 수 있음.

Made with Slashpage