Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Benchmarking the Pedagogical Knowledge of Large Language Models

ReDit: Reward Dithering for Improved LLM Policy Optimization

Multimodal Fusion SLAM with Fourier Attention

Understanding Reasoning in Thinking Language Models via Steering Vectors

KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation

Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems

AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning

PBFT-Backed Semantic Voting for Multi-Agent Memory Pruning

Long-Context Generalization with Sparse Attention

SycnMapV2: Robust and Adaptive Unsupervised Segmentation

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook

AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

Understanding Human-AI Trust in Education

cuVSLAM: CUDA accelerated visual odometry and mapping

TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning

SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification

Disentangling Reasoning and Knowledge in Medical Large Language Models

Process Reward Models That Think

AI-Assisted Transport of Radioactive Ion Beams

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Defeating Prompt Injections by Design

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

AI-Facilitated Episodic Future Thinking For Adults with Obesity

AI-Enhanced Deliberative Democracy and the Future of the Collective Will

Robust Optimization with Diffusion Models for Green Security

VesselSAM: Leveraging SAM for Aortic Vessel Segmentation with AtrousLoRA

SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation

Language Model Re-rankers are Fooled by Lexical Similarities

Evaluating link prediction: New perspectives and recommendations

"I know myself better, but not really greatly": How Well Can LLMs Detect and Explain LLM-Generated Texts?

Towards Unsupervised Multi-Agent Reinforcement Learning via Task-Agnostic Exploration

Controllable Video Generation with Provable Disentanglement

Towards Robust Stability Prediction in Smart Grids: GAN-based Approach under Data Constraints and Adversarial Challenges

Exploring the Collaborative Co-Creation Process with AI: A Case Study in Novice Music Production

Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

LAuReL: Learned Augmented Residual Layer

Do Vendi Scores Converge with Finite Samples? Truncated Vendi Score for Finite-Sample Convergence Guarantees

Multi-Continental Healthcare Modelling Using Blockchain-Enabled Federated Learning

Rational Metareasoning for Large Language Models

MOST: MR reconstruction Optimization for multiple downStream Tasks via continual learning

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks

Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges

Rich Interoperable Metadata for Cultural Heritage Projects at Jagiellonian University

Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated

Exclusive Style Removal for Cross Domain Novel Class Discovery

ClimateIQA: A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis

Are We There Yet? A Brief Survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges

Unified Neural Backdoor Removal with Only Few Clean Samples through Unlearning and Relearning

A Certified Proof Checker for Deep Neural Network Verification in Imandra

ECG-SMART-NET: A Deep Learning Architecture for Precise ECG Diagnosis of Occlusion Myocardial Infarction

ContactDexNet: Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping

The Elements of Differentiable Programming

Align and Distill: Unifying and Improving Domain Adaptive Object Detection

Interrogating AI: Characterizing Emergent Playful Interactions with ChatGPT

Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages

DF2: Distribution-Free Decision-Focused Learning

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation

HeurAgenix: Leveraging LLMs for Solving Complex Combinatorial Optimization Challenges

RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning

MCP-Zero: Active Tool Discovery for Autonomous LLM Agents

What do professional software developers need to know to succeed in an age of Artificial Intelligence?

Emergent Risk Awareness in Rational Agents under Resource Constraints

Smart Traffic Signals: Comparing MARL and Fixed-Time Strategies

TRAIL: Trace Reasoning and Agentic Issue Localization

Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Large language models for automated scholarly paper review: A survey

ChatSR: Multimodal Large Language Models for Scientific Formula Discovery

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Orthogonal Finetuning Made Scalable

Improving Progressive Generation with Decomposable Flow Matching

A standard transformer and attention with linear biases for molecular conformer generation

Persona Features Control Emergent Misalignment

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Alleviating User-Sensitive bias with Fair Generative Sequential Recommendation Model

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Cross-regularization: Adaptive Model Complexity through Validation Gradients

Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis

NeRF-based CBCT Reconstruction needs Normalization and Initialization

Who Does What in Deep Learning? Multidimensional Game-Theoretic Attribution of Function of Neural Units

Geometric-Aware Variational Inference: Robust and Adaptive Regularization with Directional Weight Uncertainty

Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

When Can We Reuse a Calibration Set for Multiple Conformal Predictions?

Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Human-Centered Editable Speech-to-Sign-Language Generation via Streaming Conformer-Transformer and Resampling Hook

Created by

Haebom

저자

Yingchao Li

개요

본 논문은 자연스러움, 얼굴 및 몸 표현의 제한, 사용자 제어 부재 등 기존의 엔드투엔드 수화 애니메이션 시스템의 한계를 극복하기 위해 인간 중심의 실시간 음성-수화 애니메이션 프레임워크를 제안한다. 이 프레임워크는 (1) 동기화된 상체 및 얼굴 움직임 생성을 위한 스트리밍 Conformer 인코더와 자기회귀 Transformer-MDN 디코더, (2) 청각장애 사용자와 전문가가 각 수화 부분을 검사하고 수정할 수 있는 투명하고 편집 가능한 JSON 중간 표현, (3) 사용자 편집 및 평가에 기반하여 모델을 개선하는 Human-in-the-loop 최적화 루프로 구성된다. Unity3D에 배포된 이 시스템은 RTX 4070에서 평균 13ms의 프레임 추론 시간과 103ms의 엔드투엔드 지연 시간을 달성한다. 핵심 기여는 세분화된 수화 수준의 개인화를 위한 JSON 중심 편집 메커니즘의 설계와 지속적인 모델 적응을 위한 MDN 기반 피드백 루프의 최초 적용이다. 20명의 청각장애 수화자와 5명의 전문 통역사를 대상으로 한 연구에서 기준선 대비 SUS 점수 13점 향상, 인지 부하 6.7점 감소, 자연스러움과 신뢰도의 상당한 향상(p<.001)을 관찰했다. 이 연구는 접근 가능한 수화 기술을 위한 확장 가능하고 설명 가능한 AI 패러다임을 확립한다.

시사점, 한계점

•

시사점:

◦

실시간, 자연스러운 수화 애니메이션 생성을 위한 효율적인 프레임워크 제시.

◦

JSON 기반 편집 메커니즘을 통한 사용자 맞춤형 및 설명 가능한 AI 시스템 구현.

◦

MDN 기반 피드백 루프를 활용한 지속적인 모델 개선 및 사용자 참여.

◦

청각장애인의 의사소통 접근성 향상 및 인지 부하 감소.

◦

고속 처리 성능 (13ms 프레임 추론 시간, 103ms 엔드투엔드 지연 시간).

•

한계점:

◦

현재 시스템은 상체와 얼굴 움직임에 집중, 하체 움직임은 고려되지 않음.

◦

다양한 수화 언어 및 수화 스타일 지원 범위에 대한 추가 연구 필요.

◦

대규모 데이터셋을 활용한 모델 학습 및 일반화 성능 향상 필요.

◦

JSON 편집 메커니즘의 사용 편의성 개선 및 직관적인 인터페이스 개발 필요.

Made with Slashpage