Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

CoRT: Code-integrated Reasoning within Thinking

TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Policy-Based Trajectory Clustering in Offline Reinforcement Learning

Understanding Human-AI Trust in Education

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Robotic Policy Learning via Human-assisted Action Preference Optimization

LLM-D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models

QuantMCP: Grounding Large Language Models in Verifiable Financial Reality

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR

Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Sample Complexity and Representation Ability of Test-time Scaling Paradigms

Context Is Not Comprehension

High Performance Space Debris Tracking in Complex Skylight Backgrounds with a Large-Scale Dataset

SALAD: Systematic Assessment of Machine Unlearing on LLM-Aided Hardware Design

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Subgraph Gaussian Embedding Contrast for Self-Supervised Graph Representation Learning

Quantum AIXI: Universal Intelligence via Quantum Information

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

QuXAI: Explainers for Hybrid Quantum Machine Learning Models

Convert Language Model into a Value-based Strategic Planner

PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

Token-Efficient RL for LLM Reasoning

MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark

Elucidating the Design Space of Multimodal Protein Language Models

A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

On the Geometry of Receiver Operating Characteristic and Precision-Recall Curves

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Content ARCs: Decentralized Content Rights in the Age of Generative AI

PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

Computation Mechanism Behind LLM Position Generalization

CompMarkGS: Robust Watermarking for Compressed 3D Gaussian Splatting

Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations

An energy-efficient learning solution for the Agile Earth Observation Satellite Scheduling Problem

Generative Uncertainty in Diffusion Models

EgoNormia: Benchmarking Physical Social Norm Understanding

Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models

From Features to Graphs: Exploring Graph Structures and Pairwise Interactions via GNNs

Object-Centric Latent Action Learning

Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation

TransMLA: Multi-Head Latent Attention Is All You Need

Implicit Language Models are RNNs: Balancing Parallelization and Expressivity

Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Prompt-based Depth Pruning of Large Language Models

Great Models Think Alike and this Undermines AI Oversight

Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

Latent Action Learning Requires Supervision in the Presence of Distractors

SR-Reward: Taking The Path More Traveled

Heterogeneous Multi-Agent Reinforcement Learning for Distributed Channel Access in WLANs

SoK: Watermarking for AI-Generated Content

Engagement-Driven Content Generation with Large Language Models

PyGen: A Collaborative Human-AI Approach to Python Package Creation

DAWN: Designing Distributed Agents in a Worldwide Network

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Center-fixing of tropical cyclones using uncertainty-aware deep learning applied to high-temporal-resolution geostationary satellite imagery

LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

Deploying Open-Source Large Language Models: A performance Analysis

Neural Networks Generalize on Low Complexity Data

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Predictive Embedding Architecture

Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs

The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation

TimeBridge: Better Diffusion Prior Design with Bridge Models for Time Series Generation

Multi-group Uncertainty Quantification for Long-form Text Generation

Hey, That's My Model! Introducing Chain & Hash, An LLM Fingerprinting Technique

Privacy-Aware Spectrum Pricing and Power Control Optimization for LEO Satellite Internet-of-Things

IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Incentivizing Quality Text Generation via Statistical Contracts

Visually Descriptive Language Model for Vector Graphics Reasoning

Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices

Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

IoTGeM: Generalizable Models for Behaviour-Based IoT Attack Detection

Improved Algorithm for Deep Active Learning under Imbalance via Optimal Separation

ConvD: Attention Enhanced Dynamic Convolutional Embeddings for Knowledge Graph Completion

Noise Balance and Stationary Distribution of Stochastic Gradient Descent

The Packing Chromatic Number of the Infinite Square Grid is 15

Reinforcing Multimodal Understanding and Generation with Dual Self-rewards

A Proposal to Extend the Common Model of Cognition with Metacognition

The Optimization Paradox in Clinical AI Multi-Agent Systems

CHANCERY: Evaluating Corporate Governance Reasoning Capabilities in Language Models

DeePoly: A High-Order Accuracy Scientific Machine Learning Framework for Function Approximation and Solving PDE

Beamforming and Resource Allocation for Delay Optimization in RIS-Assisted OFDM Systems

Evaluation of LLMs for mathematical problem solving

The Automated but Risky Game: Modeling Agent-to-Agent Negotiations and Transactions in Consumer Markets

A Heuristic Algorithm Based on Beam Search and Iterated Local Search for the Maritime Inventory Routing Problem

A Vision for Auto Research with LLM Agents

AssistanceZero: Scalably Solving Assistance Games

Don't Lag, RAG: Training-Free Adversarial Detection Using RAG

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

Training-Free Safe Denoisers for Safe Use of Diffusion Models

CollabLLM: From Passive Responders to Active Collaborators

Position: Theory of Mind Benchmarks are Broken for Large Language Models

Phonology-Guided Speech-to-Speech Translation for African Languages

Created by

Haebom

저자

Peter Ochieng, Dennis Kaburu

개요

본 논문은 음성-음성 번역(S2ST)을 위한 운율 안내 프레임워크를 제시합니다. 이 프레임워크는 언어 간 휴지 동기화를 활용하여 전사 없이 음성을 정렬하고 번역합니다. 5개 언어에 걸친 6,000시간 규모의 동아프리카 뉴스 말뭉치를 분석하여, 같은 어족에 속하는 언어 쌍은 다른 어족에 속하는 쌍보다 휴지 분산이 3040% 낮고, 시작/종료 상관관계가 3배 이상 높다는 것을 보여줍니다. 이러한 결과는 침묵 일관성, 속도 동기화 및 의미 유사성을 통합하는 동적 프로그래밍 정렬 알고리즘인 SPaDA를 제시하게 했습니다. SPaDA는 정렬 F1 점수를 34점 향상시키고, 탐욕적 VAD 기준선에 비해 최대 38%의 잘못된 매칭을 제거합니다. SPaDA로 정렬된 세그먼트를 사용하여, 고정된 의미 및 화자 인코더의 외부 기울기를 사용하여 안내되는 확산 기반 S2ST 모델인 SegUniDiff를 훈련합니다. SegUniDiff는 BLEU 점수에서 향상된 캐스케이드 모델과 동등한 성능을 보이며 (CVSS-C에서 30.3 대 UnitY의 28.9), 화자 오류율(EER)을 12.5%에서 5.3%로 줄이고, 1.02의 실시간 비율(RTF)로 실행됩니다. 저자원 환경에서의 평가를 지원하기 위해, 인간 판단과 강한 상관관계를 갖는 3단계 전사 없는 BLEU 평가 세트(M1~M3)도 공개합니다. 결과적으로, 다국어 음성의 운율적 단서는 확장 가능하고 비자동회귀적인 S2ST를 위한 신뢰할 수 있는 기반을 제공함을 보여줍니다.

시사점, 한계점

•

시사점:

◦

운율 정보(특히 휴지)를 활용하여 전사 없이 음성-음성 번역의 정렬 및 번역 성능을 향상시킬 수 있음을 보여줌.

◦

SPaDA 알고리즘과 SegUniDiff 모델을 통해 기존 S2ST 모델보다 높은 BLEU 점수와 낮은 화자 오류율을 달성.

◦

저자원 환경에서의 평가를 위한 전사 없는 BLEU 평가 세트(M1-M3) 제시.

◦

비자동회귀적 S2ST 모델의 효율성 및 성능 개선에 기여.

•

한계점:

◦

6,000시간의 동아프리카 뉴스 말뭉치를 사용했으므로, 다른 언어 또는 도메인으로의 일반화 가능성에 대한 추가 연구가 필요.

◦

평가 세트가 새롭게 제시되었으나, 기존의 다른 평가 지표와의 비교 분석이 부족.

◦

SPaDA 알고리즘의 복잡도 및 계산 비용에 대한 분석이 부족.

◦

다양한 어족에 대한 실험이 더 필요함.

Made with Slashpage