Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

CTA: Cross-Task Alignment for Better Test Time Training

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Domain Generalizable Portrait Style Transfer

StreamDiT: Real-Time Streaming Text-to-Video Generation

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Neural-Network solver of ideal MHD equilibria

RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Hita: Holistic Tokenizer for Autoregressive Image Generation

Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem

Horus: A Protocol for Trustless Delegation Under Uncertainty

Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization

Instruction Following by Boosting Attention of Large Language Models

Evaluating Logit-Based GOP Scores for Mispronunciation Detection

LLMs on support of privacy and security of mobile apps: state of the art and research directions

On the Fundamental Impossibility of Hallucination Control in Large Language Models

Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting

cuVSLAM: CUDA accelerated visual odometry and mapping

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge

An empirical study of task and feature correlations in the reuse of pre-trained models

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Towards General Continuous Memory for Vision-Language Models

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

Bayesian Hierarchical Invariant Prediction

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

The GenAI Generation: Student Views of Awareness, Preparedness, and Concern

Variational OOD State Correction for Offline Reinforcement Learning

Heat Diffusion Models -- Interpixel Attention Mechanism

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Offline Learning and Forgetting for Reasoning with Large Language Models

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

PVChat: Personalized Video Chat with One-Shot Learning

Challenges and Trends in Egocentric Vision: A Survey

Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

A Survey on Transformer Context Extension: Approaches and Evaluation

Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government

A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

RSPO: Regularized Self-Play Alignment of Large Language Models

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Efficient Risk-sensitive Planning via Entropic Risk Measures

Bayesian Optimization for Controlled Image Editing via LLMs

Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment

Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

A Theory for Conditional Generative Modeling on Multiple Data Sources

Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution

Aria-UI: Visual Grounding for GUI Instructions

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning

Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model

An AI Theory of Mind Will Enhance Our Collective Intelligence

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Longitudinal Ensemble Integration for sequential classification with multimodal data

Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI

Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort

Empirical evidence of Large Language Model's influence on human spoken communication

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics

CoDy: Counterfactual Explainers for Dynamic Graphs

Optimal Transport for Domain Adaptation through Gaussian Mixture Models

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

Detecting value-expressive text posts in Russian social media

Deep neural networks have an inbuilt Occam's razor

TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

MedGemma Technical Report

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Activation Steering for Chain-of-Thought Compression

MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation

Created by

Haebom

저자

Fathinah Izzati, Xinyue Li, Yuxuan Wu, Gus Xia

개요

본 논문은 음악을 듣고 다양한 분위기와 배경을 상상하는 인간의 능력을 모방하는 음악 장면 상상(MSI) 모델을 제시합니다. 기존 음악 캡션 생성 모델이 음악적 요소에만 집중하는 것과 달리, MusiScene이라는 모델을 제안하여 음악에 어울리는 장면을 묘사하는 캡션을 생성합니다. 이를 위해 3,371개의 비디오-오디오 캡션 쌍으로 구성된 대규모 데이터셋을 구축하고, MU-LLaMA를 MSI 작업에 맞춰 미세 조정하여 MusiScene을 개발하였습니다. 실험 결과, MusiScene이 MU-LLaMA보다 문맥에 맞는 캡션을 생성하는 데 더 뛰어나다는 것을 보여주고, 생성된 MSI 캡션을 활용하여 텍스트 기반 비디오 배경 음악 생성(VBMG)을 향상시킬 수 있음을 제시합니다.

시사점, 한계점

•

시사점:

◦

음악과 시각적 정보 간의 상호 작용을 이해하는 새로운 음악 모델(MusiScene)을 제시.

◦

대규모 비디오-오디오 캡션 데이터셋 구축을 통해 MSI 연구의 기반 마련.

◦

MusiScene이 기존 모델보다 더욱 문맥에 적합한 음악 캡션 생성 능력을 보임.

◦

생성된 캡션을 활용하여 VBMG 성능 향상 가능성 제시.

•

한계점:

◦

데이터셋의 규모가 더욱 확장될 필요가 있음. (3,371개는 상대적으로 적은 양일 수 있음)

◦

모델의 일반화 능력에 대한 추가적인 검증이 필요함.

◦

다양한 음악 장르와 스타일을 모두 충분히 고려했는지에 대한 검토 필요.

◦

MSI 캡션의 질적 평가에 대한 보다 심도있는 분석 필요.

Made with Slashpage