/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
CTA: Cross-Task Alignment for Better Test Time Training
OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization
Domain Generalizable Portrait Style Transfer
StreamDiT: Real-Time Streaming Text-to-Video Generation
From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
Neural-Network solver of ideal MHD equilibria
RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria
Hita: Holistic Tokenizer for Autoregressive Image Generation
Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem
Horus: A Protocol for Trustless Delegation Under Uncertainty
Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-hot Subsurface Understanding
SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures
WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling
IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes
Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization
Instruction Following by Boosting Attention of Large Language Models
Evaluating Logit-Based GOP Scores for Mispronunciation Detection
LLMs on support of privacy and security of mobile apps: state of the art and research directions
On the Fundamental Impossibility of Hallucination Control in Large Language Models
Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting
CuVSLAM: CUDA accelerated visual odometry and mapping
Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge
An empirical study of task and feature correlations in the reuse of pre-trained models
EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG
Hume: Introducing System-2 Thinking in Visual-Language-Action Model
Towards General Continuous Memory for Vision-Language Models
Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)
Bayesian Hierarchical Invariant Prediction
Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling
Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review
The GenAI Generation: Student Views of Awareness, Preparedness, and Concern
Variational OOD State Correction for Offline Reinforcement Learning
Heat Diffusion Models - Interpixel Attention Mechanism
NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models
Offline Learning and Forgetting for Reasoning with Large Language Models
Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models
PVChat: Personalized Video Chat with One-Shot Learning
Challenges and Trends in Egocentric Vision: A Survey
Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection
Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model
A Survey on Transformer Context Extension: Approaches and Evaluation
Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government
A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models
Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records
GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification
Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling
RSPO: Regularized Self-Play Alignment of Large Language Models
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
Efficient Risk-sensitive Planning via Entropic Risk Measures
Bayesian Optimization for Controlled Image Editing via LLMs
Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation
Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment
Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions
A Theory for Conditional Generative Modeling on Multiple Data Sources
Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport
Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics
DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution
Aria-UI: Visual Grounding for GUI Instructions
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Pretrained Reversible Generation as Unsupervised Visual Representation Learning
Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG
Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning
Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model
An AI Theory of Mind Will Enhance Our Collective Intelligence
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
Longitudinal Ensemble Integration for sequential classification with multimodal data
Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning
Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI
Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort
Empirical evidence of Large Language Model's influence on human spoken communication
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics
CoDy: Counterfactual Explainers for Dynamic Graphs
Optimal Transport for Domain Adaptation through Gaussian Mixture Models
Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs
Detecting value-expressive text posts in Russian social media
Deep neural networks have an inbuilt Occam's razor
TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture
SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?
MedGemma Technical Report
Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift
Activation Steering for Chain-of-Thought Compression
Load more
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
Created by
Haebom
作者
Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang
概要
BMMRは、大規模な多言語、多モーダル、多学問推論データセットで、大規模多モーダルモデル(LMM)の開発と評価のための110,000の大学レベルの質問を含みます。ユネスコで定義された300の科目を取り扱い、客観式、空白埋め、主観式など、さまざまな形式の質問や本、試験、クイズなど、さまざまなソースからのデータで構成されます。人間の介入ベースのスケーラブルなフレームワークを介してキュレーションおよびフィルタリングされ、各インスタンスには高品質の推論パスが接続されています。データセットは、LMMの知識と推論能力を総合的に評価するための20,458の高品質インスタンスで構成されたBMMR-Evalと、さらなる研究開発をサポートする88,991のインスタンスで構成されるBMMR-Trainに分けられます。正確で細かい推論経路評価のためのプロセスベースの多学文検証器(BMMR-Verifier)も提案します。 24モデルの実験結果は、最先端モデルでさえBMMR-Evalでかなりの改善の余地があることを示した。 BMMR-Verifierを用いた推論チェーン解析などの深層研究により、LMMが多学問的推論に現在直面している課題を明らかにしました。データセットは公開される予定です。
Takeaways、Limitations
•
Takeaways:
◦
大規模多モーダルモデル(LMM)の多学的推論能力評価のための新しいベンチマークデータセットBMMRを提供
◦
最先端LMMの性能限界と改善方向の提示
◦
多学的推論におけるモデル偏向とオープンソースモデルと独占モデルとの間の性能差解析
◦
BMMR-Trainを利用した fine-tuning の効果確認
◦
LMMの多学的推論課題の詳細な理解
•
Limitations:
◦
データセットの規模にもかかわらず、まだ最先端のモデルは多学的推論で完璧なパフォーマンスを示していません
◦
推論モデルの特定の科目に対する偏向問題の持続
◦
オープンソースモデルと排他モデルとの間の性能差の存在
PDFを見る
Made with Slashpage