/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
Swin-TUNA: A Novel PEFT Approach for Accurate Food Image Segmentation
EarthLink: A Self-Evolving AI Agent for Climate Science
Reality Proxy: Fluid Interactions with Real-World Objects in MR via Abstract Representations
Leveraging multi-source and heterogeneous signals for fatigue detection
Segmentation-free Goodness of Pronunciation
Adaptive Relative Pose Estimation Framework with Dual Noise Tuning for Safe Approaching Maneuvers
Compositional Coordination for Multi-Robot Teams with Large Language Models
Diffusion Beats Autoregressive in Data-Constrained Settings
The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts
EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Control
Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation
Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards
GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks
SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning
Multilingual LLMs Are Not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation
Frequency-Dynamic Attention Modulation for Dense Prediction
A Survey of Deep Learning for Geometry Problem Solving
EEG Foundation Models: A Critical Review of Current Progress and Future Directions
Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
A PBN-RL-XAI Framework for Discovering a "Hit-and-Run" Therapeutic Strategy in Melanoma
Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks
OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization
A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Mechanistic Indicators of Understanding in Large Language Models
Scaling RL to Long Videos
Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model
Masked Autoencoders that Feel the Heart: Unveiling Simplicity Bias for ECG Analyses
SyncMapV2: Robust and Adaptive Unsupervised Segmentation
LLM Web Dynamics: Tracing Model Collapse in a Network of LLMs
Why Do Class-Dependent Evaluation Effects Occur with Time Series Feature Attributions? A Synthetic Data Investigation
Diffuse and Disperse: Image Generation with Representation Regularization
LLM-D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models
MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection
PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models
Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits
Machine Learning Solutions Integrated in an IoT Healthcare Platform for Heart Failure Risk Stratification
Beyond Low-rank Decomposition: A Shortcut Approach for Efficient On-Device Learning
Vision Transformers in Precision Agriculture: A Comprehensive Survey
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Position: An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Trigger without Trace: Towards Stealthy Backdoor Attack on Text-to-Image Diffusion Models
Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS
Quantum Machine Learning in Precision Medicine and Drug Discovery - A Game Changer for Tailored Treatments?
A general language model for peptide identification
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
LLM Alignment as Retriever Optimization: An Information Retrieval Perspective
Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings
Online Housing Market
Integrated Learning and Optimization for Congestion Management and Profit Maximization in Real-Time Electricity Market
Integrating Evidence into the Design of XAI and AI-based Decision Support Systems: A Means-End Framework for End-users in Construction
Scalable Parameter Design for Superconducting Quantum Circuits with Graph Neural Networks
A Survey of Event Causality Identification: Taxonomy, Challenges, Assessment, and Prospects
Neural Corrective Machine Unranking
Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation
Differentiable Motion Manifold Primitives for Reactive Motion Generation under Kinodynamic Constraints
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
RUMI: Rummaging Using Mutual Information
Neural Machine Unranking
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
Unsupervised Concept Drift Detection from Deep Learning Representations in リアルタイム
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models
DualXDA: Towards Sparse, Efficient and Explainable Data Attribution in Large AI Models
Quantifying the Uniqueness and Divisiveness of Presidential Discourse
DocTER: Evaluating Document-based Knowledge Editing
Learning Concepts Definable in First-Order Logic with Counting
Recognizing and Eliciting Weakly Single Crossing Profiles on Trees
Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments
Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs
When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems
An Integrated Framework of Prompt Engineering and Multidimensional Knowledge Graphs for Legal Dispute Analysis
DisMS-TS: Eliminating Redundant Multi-Scale Features for Time Series Classification
Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games
Beamforming and Resource Allocation for Delay Minimization in RIS-Assisted OFDM Systems
Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
SuperARC: An Agnostic Test for Narrow, General, and Super Intelligence Based On the Principles of Recursive Compression and Algorithmic Probability
IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation
OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM
Chemical reasoning in LLMs unlocks strategy-aware synthesis planning and reaction mechanism elucidation
BEARCUBS: A benchmark for computer-using web agents
From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
HPS: Hard Preference Sampling for Human Preference Alignment
A Differentiated Reward Method for Reinforcement Learning based Multi-Vehicle Cooperative Decision-Making Algorithms
Retrieving Classes of Causal Orders with Inconsistent Knowledge Bases
On the Structure of Game Provenance and its Applications
I-CEE: Tailoring Explanations of Image Classification Models to User Expertise
SIDA: Synthetic Image Driven Zero-shot Domain Adaptation
3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation
Moving Out: Physically-grounded Human-AI Collaboration
SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
Approximate SMT Counting Beyond Discrete Domains
DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
Load more
Scaling RL to Long Videos
Created by
Haebom
作者
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
概要
この論文では、長時間のビデオの推論を拡張するために強化学習を活用するフルスタックフレームワークを紹介します。この目的のために、3つの重要なコンポーネントを統合します。まず、さまざまな分野(スポーツ、ゲーム、ブログなど)で高品質の推論注釈を含む104,000の長時間ビデオQAペアで構成される大規模なデータセットLongVideo-Reasonです。第二に、事故連鎖マップ学習(CoT-SFT)と強化学習(RL)を介してVLMを拡張する2段階学習パイプラインです。第三に、シーケンス並列化と長時間ビデオにカスタマイズされたvLLMベースのエンジンを統合し、効率的な展開とプリフィルのためのキャッシュビデオ埋め込みを使用する長時間ビデオRL用の学習インフラストラクチャMR-SPです。実験の結果、LongVILA-R1-7Bはビデオベンチマークで強力な性能を達成し、VideoMMEで字幕なしで65.0%、字幕があるとき70.7%の精度を記録し、複数のベンチマークでLongVILA-R1を一貫して上回りました。さらに、入力ビデオフレームの数が増加するにつれて、LongVILA-R1の性能は着実に向上しました。 MR-SPシステムは、長時間のビデオRL学習速度を最大2.1倍向上しました。最後に、さまざまなモダリティ(ビデオ、テキスト、オーディオ)、さまざまなモデル(VILAおよびQwenシリーズ)、さらには画像やビデオ生成モデルまでサポートするRL学習のためのトレーニングシステムを公開します。単一のA100ノード(8つのGPU)で最大1時間のビデオ(3,600フレーム/約256,000トークンなど)のRL学習をサポートします。
Takeaways、Limitations
•
Takeaways:
◦
長時間のビデオに対する以前よりも改善されたビデオ言語モデルの推論性能を提示する。
◦
効率的な長時間ビデオ強化学習のための新しいフレームワーク(MR-SP)を提示します。
◦
大規模で長時間のビデオQAデータセットLongVideo-Reasonを公開します。
◦
公開された学習システムは、さまざまなモダリティとモデルをサポートし、研究の再現性とスケーラビリティを高めます。
•
Limitations:
◦
データセットの多様性とバランスの詳細な説明はありません。
◦
強化学習アルゴリズムの具体的な詳細が欠けているため、再現性が困難になる可能性があります。
◦
特定のハードウェア環境(A100ノード)に依存するパフォーマンス結果が提示され、一般化の可能性に関するさらなる研究が必要です。
◦
LongVILA-R1-7Bモデルのパラメータサイズに関する情報が不足しています。
PDFを見る
Made with Slashpage