/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Ada-TransGNN: An Air Quality Prediction Model Based On Adaptive Graph Convolutional Networks
Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery
Consistent Opponent Modeling of Static Opponents in Imperfect-Information Games
Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes
Agentic AI for Software: thoughts from Software Engineering community
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling
A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems
Generative Artificial Intelligence and Agents in Research and Teaching
CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression
Comparative Analysis of UAV Path Planning Algorithms for Efficient Navigation in Urban 3D Environments
Retrieval Enhanced Feedback via In-context Neural Error-book
From Confidence to Collapse in LLM Factual Robustness
On Task Vectors and Gradients
Learning in Repeated Multi-Objective Stackelberg Games with Payoff Manipulation
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
DLLMQuant: Quantizing Diffusion-based Large Language Models
LLM-Enhanced Linear Autoencoders for Recommendation
Leveraging GNN to Enhance MEF Method in Predicting ENSO
Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation
New Kid in the Classroom: Exploring Student Perceptions of AI Coding Assistants
Large Language Model-Based Framework for Explainable Cyberattack Detection in Automatic Generation Control Systems
SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
Apple Intelligence Foundation Language Models: Tech Report 2025
SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models
Demographic-aware fine-grained classification of pediatric wrist fractures
Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Solar Altitude Guided Scene Illumination
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models
EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion
RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection
Revisiting SSL for sound event detection: complementary fusion and adaptive post-processing
Concept-Guided Interpretability via Neural Chunking
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study
An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach
Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
Video CLIP Model for Multi-View Echocardiography Interpretation
A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Noise-based reward-modulated learning
Faster Parameter-Efficient Tuning with Token Redundancy Reduction
UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
TableTalk: Scaffolding Spreadsheet Development with a Language Agent
StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel
Provably-Safe Neural Network Training Using Hybrid Zonotope Reachability Analysis
Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot
Safe Multiagent Coordination via Entropic Exploration
TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use
Cultural Dimensions of AI Perception: Charting Expectations, Risks, Benefits, Tradeoffs, and Value in Germany and China
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI
Hierarchical Object-Oriented POMDP Planning for Object Rearrangement
From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification
Secure Reinforcement Learning via Shuffle Privacy Model
Overcoming label shift with target-aware federated learning
Benchmarking XAI Explanations with Human-Aligned Evaluations
HonestCyberEval: An AI Cyber Risk Benchmark for Automated Software Exploitation
Leveraging Multi-facet Paths for Heterogeneous Graph Representation Learning
GeNet: A Multimodal LLM-Based Co-Pilot for Network Topology and Configuration
ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
Ego-Foresight: Self-supervised Learning of Agent-Aware Representations for Improved RL
Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis
Learning county from pixels: corn yield prediction with attention-weighted multiple instance learning
Memory augment is All You Need for image restoration
Rethinking Distribution Shifts: Empirical Analysis and Inductive Modeling for Tabular Data
DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models
Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles
Bayesian Deep Learning for Segmentation for Autonomous Safe Planetary Landing
ST-Raptor: LLM-Powered Semi-Structured Table Question Answering
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence
Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots
Profile-Aware Maneuvering: A Dynamic Multi-Agent System for Robust GAIA Problem Solving by AWorld
Multi-Agent LLMs as Ethics Advocates for AI-Based Systems
Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
MRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
The Influence of Human-inspired Agentic Sophistication in LLM-driven Strategic Reasoners
YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models
Consensus in Motion: A Case of Dynamic Rationality of Sequential Learning in Probability Aggregation
Can Large Language Models Act as Ensembler for Multi-GNNs?
Pessimistic Iterative Planning with RNNs for Robust POMDPs
Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding
Integrating Large Language Model for Improved Causal Discovery
A Survey on Causal Discovery: Theory and Practice
Generative Interfaces for Language Models
Interpolating Speaker Identities in Embedding Space for Data Expansion
VibeVoice Technical Report
LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding
Understanding Tool-Integrated Reasoning
Emotions as Ambiguity-aware Ordinal Representations
Real-Time Model Checking for Closed-Loop Robot Reactive Planning
Load more
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
Created by
Haebom
作者
Zhuqiang Lu, Zhenfei Yin, Mengwei He, Zhihui Wang, Zicheng Liu, Zhiyong Wang, Kun Hu
概要
この論文では、ビジョン大規模言語モデル(VLLM)を使用した長時間の画像理解の難しさを解決するために、テキスト条件付き適応フレーム選択モジュールと時間フレームトークンマージ技術、空間トークンサンプリングモジュール、およびマージ戦略を活用したBalanced-VLLM(B-VLLM)フレームワークを紹介します。既存のVLLMは、ビデオダウンサンプリングまたは各フレームの視覚トークン数の減少によって時間的または空間的情報損失が発生する問題を解決するために、課題関連の時空間的手がかりを効果的に活用しながら、VLLMのコンテキストウィンドウの長さ内で視覚トークン数を制限する方法を提案します。実験の結果、B-VLLMがさまざまなイメージングベンチマークで優れた性能を示すことが確認されました。
Takeaways、Limitations
•
Takeaways:
◦
VLLMベースの長時間映像理解の効率性を大幅に向上させました。
◦
テキスト条件付き適応フレーム選択とトークンマージ戦略により、課題関連情報の損失を最小限に抑えました。
◦
さまざまなイメージングベンチマークで、従来の方法より優れたパフォーマンスを達成しました。
◦
公開されたコードで再現性を高めました。
•
Limitations:
◦
提案された方法の計算の複雑さの詳細な分析が不足しています。
◦
特定のタイプの画像データに対する性能偏向の可能性が存在する。
◦
より多様で複雑な画像理解の課題に対する追加の実験が必要です。
PDFを見る
Made with Slashpage