/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
Interleaving Reasoning for Better Text-to-Image Generation
Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation
Signal-Based Malware Classification Using 1D CNNs
Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning
BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding
No Thoughts Just AI: Biased LLM Hiring Recommendations Alter Human Decision Making and Limit Human Autonomy
What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
HodgeFormer: Transformers for Learnable Operators on Triangular Meshes through Data-Driven Hodge Matrices
CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models
Pilot Study on Generative AI and Critical Thinking in Higher Education Classrooms
ZkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
Ultra-Low-Latency Spiking Neural Networks with Temporal-Dependent Integrate-and-Fire Neuron Model for Objects Detection
Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models
A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems
Trust but Verify! A Survey on Verification Design for Test-time Scaling
Research on Conversational Recommender System Considering Consumer Types
A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control
Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM
A Mixed User-Centered Approach to Enable Augmented Intelligence in Intelligent Tutoring Systems: The Case of MathAIde app
Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs
MoRPI-PINN: A Physics-Informed Framework for Mobile Robot Pure Inertial Navigation
Conditional Video Generation for High-Efficiency Video Compression
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models
Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations
HueManity: Probing Fine-Grained Visual Perception in MLLMs
Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments
Localizing Persona Representations in LLMs
Multi-output Classification using a Cross-talk Architecture for Compound Fault Diagnosis of Motors in Partially Labeled Condition
SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
Visuospatial Cognitive Assistant
Overflow Prevention Enhances Long-Context Recurrent LLMs
GRADA: Graph-based Reranking against Adversarial Documents Attack
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?
Llama-Nemotron: Efficient Reasoning Models
Tripartite-GraphRAG via Plugin Ontologies
DMS-Net:Dual-Modal Multi-Scale Siamese Network for Binocular Fundus Image Classification
Enhancing Traffic Incident Response through Sub-Second Temporal Localization with HybridMamba
Audio-centric Video Understanding Benchmark without Text Shortcut
The Model Hears You: Audio Language Model Deployments Should Consider the Principle of Least Privilege
Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution
DistJoin: A Decoupled Join Cardinality Estimator based on Adaptive Neural Predicate Modulation
MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification
Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
TrojanRobot: Physical-world Backdoor Attacks Against VLM-based Robotic Manipulation
Automatically Detecting Online Deceptive Patterns
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning
CTourLLM: Enhancing LLMs with Chinese Tourism Knowledge
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
MSRFormer: Road Network Representation Learning using Multi-scale Feature Fusion of Heterogeneous Spatial Interactions
Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts
EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation
AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning
MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes
Benchmarking for Domain-Specific LLMs: A Case Study on Academia and Beyond
CountQA: How Well Do MLLMs Count in the Wild?
ASP-FZN: A Translation-based Constraint Answer Set Solver
MedGellan: LLM-Generated Medical Guidance to Support Physicians
Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level, Event-Centric Approach to Legal Knowledge Graphs
Addition in Four Movements: Mapping Layer-wise Information Trajectories in LLMs
GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
Automatic Reward Shaping from Confounded Offline Data
Visualizing Thought: Conceptual Diagrams Enable Robust Combinatorial Planning in LMMs
COMMA: A Communicative Multimodal Multi-Agent Benchmark
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism
Self-Emotion-Mediated Exploration in Artificial Intelligence Mirrors: Findings from Cognitive Psychology
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
ACE and Diverse Generalization via Selective Disagreement
Bringing Multi-Modal Multi-Task Federated Foundation Models to Education Domain: Prospects and Challenges
ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation
Breaking Android with AI: A Deep Dive into LLM-Powered Exploitation
Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s
GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models
Multimodal Contrastive Pretraining of CBCT and IOS for Enhanced Tooth Segmentation
Uncovering Scaling Laws for Large Language Models via Inverse Problems
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning
Deep Learning-Based Burned Area Mapping Using Bi-Temporal Siamese Networks and AlphaEarth Foundation Datasets
Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost
Forecasting Russian Equipment Losses Using Time Series and Deep Learning Models
Enhanced SegNet with Integrated Grad-CAM for Interpretable Retinal Layer Segmentation in OCT Images
Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment
XSRD-Net: EXplainable Stroke Relapse Detection
Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning
What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects
Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
Enhancing Online Learning by Integrating Biosensors and Multimodal Learning Analytics for Detecting and Predicting Student Behavior: A Review
Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems
Load more
Visualizing Thought: Conceptual Diagrams Enable Robust Combinatorial Planning in LMMs
Created by
Haebom
作者
Nasim Borazjanizadeh, Roei Herzig, Eduard Oks, Trevor Darrell, Rogerio Feris, Leonid Karlinsky
概要
この論文では、人間の推論能力を模倣し、複雑な多段階作業における大規模マルチモーダルモデル(LMM)のパフォーマンスを向上させる新しいフレームワークである「Visual Thinking」を提案します。 Visual Thinkingは、LMMに独自の概念図を介して推論させることで、テキストベースの推論の限界を克服します。これは、グラフベースの推論フレームワークにビームサーチとディープバックトラッキングを統合することによって最適化されており、作業説明だけで動作するゼロショット方式です。 PDDL計画ドメインでの実験の結果、BlocksworldやFloor Tilesなどのさまざまな複雑な計画問題で、従来の方法よりも大幅に改善されたパフォーマンスが得られました。特に、GPT-4oモデルのBlocksworld問題解決率を35.5%から90.2%に大幅に向上させ、さらに難しい問題でもo1-previewモデルを凌駕する結果を得ました。これは、概念図がLMMの推論媒体として重要な役割を果たしていることを示しています。
Takeaways、Limitations
•
Takeaways:
◦
LMMの推論能力を向上させるための新しいアプローチを提示します。概念図を活用したVisual Thinkingフレームワークは、LMMの限られたテキストベースの推論を克服し、複雑なトラブルシューティング能力を向上させます。
◦
ゼロショット学習の可能性:人間の介入なしに自然言語の説明だけで動作し、実用性を高めます。
◦
さまざまな複雑な計画問題で優れたパフォーマンス:従来の方法と比較して、大幅に改善されたパフォーマンスが複数のベンチマークで実証されました。
◦
概念図の重要性を強調する:概念図がLMMの推論の過程で効果的な媒体であることを示しています。
•
Limitations:
◦
ダイアグラムの作成と解釈の精度への依存性:生成されたダイアグラムの品質によっては、パフォーマンスが影響を受ける可能性があります。
◦
特定の種類の問題のパフォーマンス評価:PDDL計画ドメインに限定された評価で、他の種類の問題の一般化の可能性にはさらなる研究が必要です。
◦
計算コスト:ビームサーチとバックトラッキングを使用する複雑なアルゴリズムにより、計算コストが高くなる可能性があります。
◦
図の解釈可能性:生成された図の解釈可能性の追加分析が必要です。
PDFを見る
Made with Slashpage