Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training

R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning

DiTraj: training-free trajectory control for video diffusion transformer

Agribot: agriculture-specific question answer system

$\mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning

Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

SiNGER: A Clearer Voice Distills Vision Transformers Further

i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents

Experience Deploying Containerized GenAI Services at an HPC Center

EmbeddingGemma: Powerful and Lightweight Text Representations

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Responsible AI Technical Report

Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks

Diversity Boosts AI-Generated Text Detection

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

Self-Evolving LLMs via Continual Instruction Tuning

Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

Joint Memory Frequency and Computing Frequency Scaling for Energy-efficient DNN Inference

StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

Accurate and Efficient Low-Rank Model Merging in Core Space

Patterns in the Transition From Founder-Leadership to Community Governance of Open Source

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning

Evaluating undergraduate mathematics examinations in the era of generative AI: a curriculum-level case study

Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication

A Systematic Survey on Large Language Models for Evolutionary Optimization: From Modeling to Solving

DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum

Diffusion Generative Models Meet Compressed Sensing, with Applications to Imaging and Finance

Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula

Grocery to General Merchandise: A Cross-Pollination Recommender using LLMs and Real-Time Cart Context

Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

Can General-Purpose Omnimodels Compete with Specialists? A Case Study in Medical Image Segmentation

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

TReF-6: Inferring Task-Relevant Frames from a Single Demonstration for One-Shot Skill Generalization

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Automatic Question & Answer Generation Using Generative Large Language Model (LLM)

CORE-RAG: Lossless Compression for Retrieval-Augmented LLMs via Reinforcement Learning

What Matters in Data for DPO?

Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data

Speculative Safety-Aware Decoding

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Contrastive Representations for Temporal Reasoning

Semantic Discrepancy-aware Detector for Image Forgery Identification

G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

PakBBQ: A Culturally Adapted Bias Benchmark for QA

MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

CTTS: Collective Test-Time Scaling

The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet

Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Can Language Models Discover Scaling Laws?

When Engineering Outruns Intelligence: Rethinking Instruction-Guided Navigation

A Markov Categorical Framework for Language Modeling

Moving Out: Physically-grounded Human-AI Collaboration

GLANCE: Graph Logic Attention Network with Cluster Enhancement for Heterophilous Graph Representation Learning

The Ever-Evolving Science Exam

Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Learning to summarize user information for personalized reinforcement learning from human feedback

Making Language Model a Hierarchical Classifier

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Mitigating Watermark Forgery in Generative Models via Randomized Key Selection

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

PRIME: Large Language Model Personalization with Cognitive Dual-Memory and Personalized Thought Process

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem

Learning to Segment for Vehicle Routing Problems

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Semantic-guided Diverse Decoding for Large Language Model

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

Created by

Haebom

저자

Feilong Chen, Yijiang Liu, Yi Huang, Hao Wang, Miren Tian, Ya-Qi Yu, Minghui Liao, Jihao Wu

MindVL: Ascend NPUs 기반의 멀티모달 대규모 언어 모델

개요

본 논문은 Ascend NPUs에서 훈련된 멀티모달 대규모 언어 모델(MLLM)인 MindVL을 제안한다. MindVL은 기존 MLLM 훈련의 제한적인 하드웨어 플랫폼 의존성과 비공개 데이터 레시피 문제를 해결하고자 한다. MindSpeed-MLLM이라는 효율적인 훈련 프레임워크를 통해 Ascend 하드웨어에서 대규모 Dense 및 Mixture-of-Experts (MoE) 모델의 안정적이고 고성능 훈련을 지원한다. 또한, 훈련 데이터 생산 방법과 혼합 전략에 대한 체계적이고 공개적인 설명을 제공한다. MindVL은 Ascend NPUs에서 end-to-end 방식으로 훈련된 데이터 효율적인 MLLM이며, 다양한 시퀀스 길이로 훈련된 체크포인트의 가중치를 평균하는 방식과 테스트 시간 해상도 탐색 기법을 통해 성능을 향상시켰다. MindVL-8B는 Qwen2.5VL-7B의 10% 데이터로 동일한 성능을 달성했으며, MoE 모델인 MindVL-671B-A37B는 Qwen2.5VL-72B의 3% 데이터로 유사한 성능을 보였다.

시사점, 한계점

•

시사점:

◦

Ascend 하드웨어를 MLLM 훈련의 유효한 대안으로 제시.

◦

오픈 데이터 레시피를 제공하여 연구의 재현성과 개방성을 증진.

◦

체크포인트 가중치 평균 및 테스트 시간 해상도 탐색과 같은 효과적인 성능 향상 기술 제시.

◦

데이터 효율적인 모델 훈련을 통해 적은 데이터로도 경쟁력 있는 성능을 달성.

•

한계점:

◦

논문에서 구체적인 데이터셋 규모나 모델 아키텍처에 대한 상세 정보가 부족할 수 있음.

◦

다른 최첨단 모델과의 전반적인 비교 및 광범위한 벤치마크 결과가 충분하지 않을 수 있음.

◦

Ascend NPUs에 특화된 훈련 프레임워크이므로 다른 하드웨어 환경에서의 일반화 가능성은 제한적일 수 있음.

◦

모델의 실제 활용 가능성 및 다양한 실제 문제에 대한 적용 사례에 대한 분석이 부족할 수 있음.

Made with Slashpage