/
/
Daily Arxiv
Daily Arxiv
世界中で発行される人工知能関連の論文をまとめるページです。
このページはGoogle Geminiを活用して要約し、非営利で運営しています。
論文の著作権は著者および関連機関にあり、共有する際は出典を明記してください。
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system
When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges
TempOpt - Unsupervised Alarm Relation Learning for Telecommunication Networks
A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models
Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization
Yan: Foundational Interactive Video Generation
MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis
VGGSounder: Audio-Visual Evaluations for Foundation Models
Capabilities of GPT-5 on Multimodal Medical Reasoning
C-MAG: Cascade Multimodal Attributed Graphs for Supply Chain Link Prediction
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
MIND: A Noise-Adaptive Denoising Framework for Medical Images Integrating Multi-Scale Transformer
FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities
Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens
Generalizing Scaling Laws for Dense and Sparse Large Language Models
Memp: Exploring Agent Procedural Memory
InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
Request-Only Optimization for Recommendation Systems
Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport
GTPO: Trajectory-Based Policy Optimization in Large Language Models
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection
DRWKV: Focusing on Object Edges for Low-Light Image Enhancement
A multi-strategy improved snake optimizer for 3-dimensional UAV path planning and engineering problems
Fragment size density estimator for shrinkage-induced fracture based on a physics-informed neural network
GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks
Audio-3DVG:Unified Audio - Point Cloud Fusion for 3D Visual Grounding
Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents
OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness
SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving
The Importance of Being Lazy: Scaling Limits of Continual Learning
Human Motion Capture from Loose and Sparse Inertial Sensors with Garment-aware Diffusion Models
HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
Open-Set LiDAR Panoptic Segmentation Guided by Uncertainty-Aware Learning
Poison Once, Control Anywhere: Clean-Text Visual Backdoors in VLM-based Mobile Agents
MGDFIS: Multi-scale Global-detail Feature Integration Strategy for Small Object Detection
Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems
ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark
Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning
Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques
Exploring Scaling Laws for EHR Foundation Models
MapStory: Prototyping Editable Map Animations with LLM Agents
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
Halting Recurrent GNNs and the Graded $\mu$-Calculus
Deep Learning Warm Starts for Trajectory Optimization on the International Space Station
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
FedRecon: Missing Modality Reconstruction in Heterogeneous Distributed Environments
AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes
Mosaic: Composite Projection Pruning for Resource-efficient LLMs
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention
The Illusory Normativity of Rights-Based AI Regulation
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Simulating the Real World: A Unified Survey of Multimodal Generative Models
RIZE: Regularized Imitation Learning via Distributional Reinforcement Learning
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
EvoP: Robust LLM Inference via Evolutionary Pruning
Conformal Prediction of Classifiers with Many Classes based on Noisy Labels
Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions
GenAI Confessions: Black-box Membership Inference for Generative Image Models
Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions
Evaluation of Bio-Inspired Models under Different Learning Settings For Energy Efficiency in Network Traffic Prediction
SLTNet: Efficient Event-based Semantic Segmentation with Spike-driven Lightweight Transformer-based Networks
Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance
Learning Characteristics of Reverse Quaternion Neural Network
Depth-Guided Self-Supervised Human Keypoint Detection via Cross-Modal Distillation
Retrieval-Augmented Decision Transformer: External Memory for In-context RL
Downscaling Extreme Precipitation with Wasserstein Regularized Diffusion
Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience
Return Prediction for Mean-Variance Portfolio Selection: How Decision-Focused Learning Shapes Forecasting Models
Pediatric brain tumor classification using digital histopathology and deep learning: evaluation of SOTA methods on a multi-center Swedish cohort
CTRQNets & LQNets: Continuous Time Recurrent and Liquid Quantum Neural Networks
Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions
SpectralEarth: Training Hyperspectral Foundation Models at Scale
Towards flexible perception with visual memory
Integrating Clinical Knowledge Graphs and Gradient-Based Neural Systems for Enhanced Melanoma Diagnosis via the 7-Point Checklist
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data
Towards Black-Box Membership Inference Attack for Diffusion Models
Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs
From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility Gap
Learning to Defer in Congested Systems: The AI-Human Interplay
LEAVES: Learning Views for Time-Series Biobehavioral Data in Contrastive Learning
Game-Theoretic Multiagent Reinforcement Learning
SMA:Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling
Aryabhata: An exam-focused language model for JEE Math
Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach
Large Language Models Do Not Simulate Human Psychology
LLM Robustness Leaderboard v1 --Technical report
One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
StepFun-Prover Preview: Let's Think and Verify Step by Step
MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines
Load more
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Created by
Haebom
作者
Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen
概要
本稿では、感情表現の制御が可能な新しいTTSモデルであるEmoVoiceを提案します。 EmoVoiceは、大規模な言語モデル(LLM)を活用して、自由で細かい自然言語感情制御を可能にします。また、思考の連鎖(CoT)やモダリティの連鎖(CoM)技術に触発され、音素トークンとオーディオトークンを並列に出力する音素ブースト変形設計により、コンテンツの一貫性を向上させます。高品質の40時間分の英語感情データセットであるEmoVoice-DBも一緒に紹介します。このデータセットには、表現力のある音声と細かい感情ラベルと自然言語の説明が含まれています。 EmoVoiceは、合成トレーニングデータのみを使用して英語のEmoVoice-DBテストセットで、独自のデータを使用して中国のSecapテストセットで最先端のパフォーマンスを達成します。さらに、既存の感情評価指標の信頼性と人間の知覚の好みとの整列を調べ、最先端のマルチモーダルLLMであるGPT-4o-audioとGeminiを使用して感情的な声を評価します。データセット、コード、チェックポイント、デモサンプルはGitHubで公開されています。
Takeaways、Limitations
•
Takeaways:
◦
LLMを活用した自由で細かい自然言語感情制御が可能なTTSモデルEmoVoice提案
◦
音素ブースト変形設計によるコンテンツの一貫性の向上
◦
高品質英語感情データセットEmoVoice-DB公開。
◦
合成データだけで最先端の性能を達成。
◦
既存の感情評価指標の信頼性と人間の知覚の好みとの整列研究
◦
最先端マルチモーダルLLMを用いた感情的音声評価
◦
コード、データセット、チェックポイント、デモサンプルの公開による研究の再現性の確保。
•
Limitations:
◦
EmoVoice-DBは英語中心に構成されており、他の言語の一般化の可能性は限られている可能性があります。
◦
合成データのみで訓練されたので、実際の音声データを用いた訓練結果との比較研究が必要。
◦
既存の感情評価指標の限界に関する追加の研究が必要であり、より洗練された評価方法論の開発が必要です。
◦
GPT-4o-audioやGeminiなどのLLMの評価結果の信頼性検証が必要である。
PDFを見る
Made with Slashpage