/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models
Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety
Emergent Social Dynamics of LLM Agents in the El Farol Bar Problem
The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors
Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare
DEXOP: A Device for Robotic Transfer of Dexterous Human Manipulation
Reinforcement Learning for Robust Ageing-Aware Control of Li-ion Battery Systems with Data-Driven Formal Verification
RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models
Gravity Well Echo Chamber Modeling With An LLM-Based Confirmation Bias Model
Insights from Gradient Dynamics: Gradient Autoscaled Normalization
Efficient Virtuoso: A Latent Diffusion Transformer Model for Goal-Conditioned Trajectory Planning
MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds
DCPO: Dynamic Clipping Policy Optimization
DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Can AI be Auditable?
Robotic Fire Risk Detection based on Dynamic Knowledge Graph Reasoning: An LLM-Driven Approach with Graph Chain-of-Thought
Navigating the EU AI Act: Foreseeable Challenges in Qualifying Deep Learning-Based Automated Inspections of Class III Medical Devices
Complementary Learning System Empowers Online Continual Learning of Vehicle Motion Forecasting in Smart Cities
MultiPL-MoE: Multi-Programming-Lingual Extension of Large Language Models through Hybrid Mixture-of-Experts
QuadKAN: KAN-Enhanced Quadruped Motion Control via End-to-End Reinforcement Learning
MovieCORE: COgnitive REasoning in Movies
Automatic Prompt Optimization with Prompt Distillation
Membership Inference Attacks on LLM-based Recommender Systems
Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning
Convergence and Generalization of Anti-Regularization for Parametric Models
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Bridging Generalization and Personalization in Human Activity Recognition via On-Device Few-Shot Learning
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering
Using Artificial Intuition in Distinct, Minimalist Classification of Scientific Abstracts for Management of Technology Portfolios
Semantic Discrepancy-aware Detector for Image Forgery Identification
Quantum-Efficient Reinforcement Learning Solutions for Last-Mile On-Demand Delivery
BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models
Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning
Real-Time Analysis of Unstructured Data with Machine Learning on Heterogeneous Architectures
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion
An Efficient Continuous-Time MILP for Integrated Aircraft Hangar Scheduling and Layout
DIRF: A Framework for Digital Identity Protection and Clone Governance in Agentic AI Systems
COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning
Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA
Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning
LanternNet: A Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
RecPS: Privacy Risk Scoring for Recommender Systems
Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)
Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias
PLAME: Lightweight MSA Design Advances Protein Folding From Evolutionary Embeddings
Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles
Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model
Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
Precise Bayesian Neural Networks
Transit for All: Mapping Equitable Bike2Subway Connection using Region Representation Learning
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems
SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations
Scaling Laws of Motion Forecasting and Planning - Technical Report
Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems
Unsupervised Evolutionary Cell Type Matching via Entropy-Minimized Optimal Transport
Multi-output Classification using a Cross-talk Architecture for Compound Fault Diagnosis of Motors in Partially Labeled Condition
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
Steering LLM Reasoning Through Bias-Only Adaptation
MetaSTH-Sleep: Towards Effective Few-Shot Sleep Stage Classification for Health Management with Spatial-Temporal Hypergraph Enhanced Meta-Learning
InterFeat: A Pipeline for Finding Interesting Scientific Features
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning
Action Flow Matching for Continual Robot Learning
Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Byzantine-Robust Federated Learning Using Generative Adversarial Networks
Beyond SHAP and Anchors: A large-scale experiment on how developers struggle to design meaningful end-user explanations
VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making
DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
CHIRLA: Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis
Kolmogorov-Arnold Fourier Networks
Position: LLMs Can be Good Tutors in English Education
Predicting Steady-State Behavior in Complex Networks with Graph Neural Networks
Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models
Motion-enhanced Cardiac Anatomy Segmentation via an Insertable Temporal Attention Module
Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
DispFormer: A Pretrained Transformer Incorporating Physical Constraints for Dispersion Curve Inversion
Integrating Evidence into the Design of XAI and AI-based Decision Support Systems: A Means-End Framework for End-users in Construction
Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection
Neural Port-Hamiltonian Differential Algebraic Equations for Compositional Learning of Electrical Networks
Sequential Controlled Langevin Diffusions
Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives
Lessons from Studying Two-Hop Latent Reasoning
HierTOD: A Task-Oriented Dialogue System Driven by Hierarchical Goals
Flexible Coded Distributed Convolution Computing for Enhanced Straggler Resilience and Numerical Stability in Distributed CNNs
FACEGroup: Feasible and Actionable Counterfactual Explanations for Group Fairness
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
Load more
Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
Created by
Haebom
作者
Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah
概要
この論文では、ビジュアル言語モデル(VLM)の視覚的推論能力の制限を解決するためのVisual Input Structure for Enhanced Reasoning(VISER)を提案します。 VLMは、知覚的特徴と視覚的参照オブジェクトを確実にリンクするのに苦労しています。 VISERは、低レベルの空間構造で視覚的な入力を強化し、シーケンシャルで空間認識解析を導くテキストプロンプトを追加する簡単で効果的な方法です。実験の結果、VISER はさまざまな視覚的推論作業で大幅なパフォーマンス向上を示しました。特にGPT-4oの視覚的検索精度を25.00%、計算精度を26.83%向上させ、シーン描写の編集距離誤差を0.32減少させ、2D合成データセットの空間関係作業性能を9.50%向上しました。純粋に言語的アプローチよりも視覚的入力設計の重要性を強調し、低レベルの視覚構造化が構成的な視覚的推論を向上させる強力で未開拓な方向であることを示唆しています。
Takeaways、Limitations
•
Takeaways:
◦
低レベルの視覚構造化がVLMの視覚的推論能力の向上に有効な方法であることを示した。
◦
純粋な言語ベースのアプローチよりも視覚的な入力設計の重要性を強調します。
◦
VISERは、単一のクエリ推論だけでバインディングの問題を改善し、効率を実証します。
◦
視覚的な検索、計算、シーンの描写、空間関係の理解など、さまざまな視覚的推論作業でパフォーマンスの向上を達成しました。
•
Limitations:
◦
現在、2D合成データセットの結果のみが提示されており、実際の世界データセットへの一般化の可能性に関するさらなる研究が必要です。
◦
提案された方法の計算コストとスケーラビリティの分析が不足しています。
◦
様々なVLMアーキテクチャの一般化の可能性に関するさらなる研究が必要である。
PDFを見る
Made with Slashpage