Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

Probing Experts' Perspectives on AI-Assisted Public Speaking Training

Learning Pole Structures of Hadronic States using Predictive Uncertainty Estimation

Neural Concept Verifier: Scaling Prover-Verifier Games via Concept Encodings

Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Objectomaly: Objectness-Aware Refinement for OoD Segmentation with Structural Consistency and Boundary Precision

KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study

EmissionNet: Air Quality Pollution Forecasting for Agriculture

The role of gain neuromodulation in layer-5 pyramidal neurons

USAD: End-to-End Human Activity Recognition via Diffusion Model with Spatiotemporal Attention

Hita: Holistic Tokenizer for Autoregressive Image Generation

Distributional Soft Actor-Critic with Diffusion Policy

Lighting the Night with Generative Artificial Intelligence

On the Necessity of Output Distribution Reweighting for Effective Class Unlearning

Upgrade or Switch: Do We Need a Next-Gen Trusted Architecture for the Internet of AI Agents?

Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs

Grokking Beyond the Euclidean Norm of Model Parameters

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

Deep Learning-Based Forecasting of Boarding Patient Counts to Address ED Overcrowding

An Exploration of Default Images in Text-to-Image Generation

Automatic Curriculum Learning for Driving Scenarios: Towards Robust and Efficient Reinforcement Learning

TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility

Boundary-Guided Trajectory Prediction for Road Aware and Physically Feasible Autonomous Driving

Balancing Progress and Safety: A Novel Risk-Aware Objective for RL in Autonomous Driving

TS-SNN: Temporal Shift Module for Spiking Neural Networks

Red Teaming Large Language Models for Healthcare

One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning

AI Safety Should Prioritize the Future of Work

MGT: Extending Virtual Try-Off to Multi-Garment Scenarios

MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs

Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval

Using AI to Summarize US Presidential Campaign TV Advertisement Videos, 1952-2012

REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Field Matching: an Electrostatic Paradigm to Generate and Transfer Data

SP$^2$T: Sparse Proxy Attention for Dual-stream Point Transformer

PIAD-SRNN: Physics-Informed Adaptive Decomposition in State-Space RNN

FonTS: Text Rendering with Typography and Style Controls

On the Principles of ReLU Networks with One Hidden Layer

Compositional Risk Minimization

End-to-end multi-channel speaker extraction and binaural speech synthesis

Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation

Downscaling Extreme Precipitation with Wasserstein Regularized Diffusion

Quantifying Context Bias in Domain Adaptation for Object Detection

An Empirical Study of Validating Synthetic Data for Formula Generation

Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

GoalNet: Goal Areas Oriented Pedestrian Trajectory Prediction

Large Language Models in Mental Health Care: a Scoping Review

SDR-GAIN: A High Real-Time Occluded Pedestrian Pose Completion Method for Autonomous Driving

Temporal Motifs for Financial Networks: A Study on Mercari, JPMC, and Venmo Platforms

Measuring AI Alignment with Human Flourishing

StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Discovering Algorithms with Computational Language Processing

A Hybrid SMT-NRA Solver: Integrating 2D Cell-Jump-Based Local Search, MCSAT and OpenCAD

A taxonomy of epistemic injustice in the context of AI and the case for generative hermeneutical erasure

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure

Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

Interpreting systems as solving POMDPs: a step towards a formal understanding of agency

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

KV Cache Steering for Inducing Reasoning in Small Language Models

Optimistic Exploration for Risk-Averse Constrained Reinforcement Learning

On Barriers to Archival Audio Processing

A Hybrid Multi-Well Hopfield-CNN with Feature Extraction and K-Means for MNIST Classification

Compress Any Segment Anything Model (SAM)

Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection

Adaptive Nonlinear Vector Autoregression: Robust Forecasting for Noisy Chaotic Time Series

Catastrophic Forgetting Mitigation Through Plateau Phase Activity Profiling

Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning

Monitoring Risks in Test-Time Adaptation

Multilingual Multimodal Software Developer for Code Generation

KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation

ONION: A Multi-Layered Framework for Participatory ER Design

A Personalised Formal Verification Framework for Monitoring Activities of Daily Living of Older Adults Living Independently in Their Homes

MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

Safe Deep Reinforcement Learning for Resource Allocation with Peak Age of Information Violation Guarantees

DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images

Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)

Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth Certificates

Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance

A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Towards Collaborative Fairness in Federated Learning Under Imbalanced Covariate Shift

Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy

To Trade or Not to Trade: An Agentic Approach to Estimating Market Risk Improves Trading Decisions

A Multi-Modal Fusion Framework for Brain Tumor Segmentation Based on 3D Spatial-Language-Vision Integration and Bidirectional Interactive Attention Mechanism

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features

White-Basilisk: A Hybrid Model for Code Vulnerability Detection

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Created by

Haebom

저자

Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang

개요

본 논문에서는 기존의 자기회귀 비디오 생성 모델들의 한계점(표준 LLM 아키텍처에서 벗어남, 부피가 큰 외부 텍스트 인코더 의존, 과도한 지연 시간)을 극복하기 위해, 최소한의 아키텍처 수정으로 LLM 아키텍처를 유지하는 자기회귀 비디오 생성 모델 Lumos-1을 제시합니다. Lumos-1은 3D RoPE의 효율성을 확인하고, 그 불균형적인 주파수 스펙트럼 범위 문제를 진단하여 개선된 MM-RoPE를 제안합니다. MM-RoPE는 기존 텍스트 RoPE를 유지하면서 다중 모달 시공간 데이터를 모델링하기 위한 포괄적인 주파수 스펙트럼과 스케일링된 3D 위치를 제공합니다. 또한, Lumos-1은 프레임 내 양방향성과 프레임 간 시간적 인과 관계를 따르는 토큰 의존성 전략을 사용하며, 공간 정보 중복으로 인한 프레임별 손실 불균형 문제를 해결하기 위해 Autoregressive Discrete Diffusion Forcing (AR-DF)를 제안합니다. AR-DF는 훈련 중에 시간적 튜브 마스킹을 도입하고, 품질 저하를 방지하기 위해 호환 가능한 추론 시 마스킹 정책을 사용합니다. 메모리 효율적인 훈련 기법을 통해 48개의 GPU만으로 Lumos-1을 사전 훈련하여 GenEval, VBench-I2V, VBench-T2V에서 EMU3, COSMOS-Video2World, OpenSoraPlan과 비교 가능한 성능을 달성했습니다. 코드와 모델은 https://github.com/alibaba-damo-academy/Lumos 에서 이용 가능합니다.

GitHub - alibaba-damo-academy/Lumos: Lumos Project: Frontier generative model research by Alibaba DAMO Academy, including Lumos-1, etc.

Lumos Project: Frontier generative model research by Alibaba DAMO Academy, including Lumos-1, etc. - alibaba-damo-academy/Lumos

시사점, 한계점

•

시사점:

◦

LLM 아키텍처를 기반으로 효율적이고 성능이 우수한 자기회귀 비디오 생성 모델을 제시.

◦

3D RoPE의 한계를 극복하는 MM-RoPE와 AR-DF를 통해 시공간 상관관계를 효과적으로 모델링.

◦

제한된 GPU 환경에서도 높은 성능을 달성하여 실용성 증대.

◦

공개된 코드와 모델을 통해 연구의 재현성과 확장성 확보.

•

한계점:

◦

Lumos-1의 성능이 다른 최첨단 모델들과 비교하여 어느 정도의 차이를 보이는지 명확하게 제시되지 않음.

◦

다양한 비디오 데이터셋에 대한 성능 평가가 추가적으로 필요.

◦

AR-DF의 효과에 대한 보다 자세한 분석이 필요.

◦

MM-RoPE의 주파수 스펙트럼 조절 방식에 대한 상세한 설명이 부족할 수 있음.

Made with Slashpage