Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation

Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents

One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study

A Convex Route to Thermomechanics: Learning Internal Energy and Dissipation

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Detection of Adversarial Attacks in Robotic Perception

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

LG-HCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting

Building evidence-based knowledge graphs from full-text literature for disease-specific biomedical reasoning

JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding

ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control

Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs

Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

Pseudo Label NCF for Sparse OHC Recommendation: Dual Representation Learning and the Separability Accuracy Trade off

Enes Causal Discovery

Cost-Sensitive Neighborhood Aggregation for Heterophilous Graphs: When Does Per-Edge Routing Help?

Robust Safety Monitoring of Language Models via Activation Watermarking

KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao

LLMON: An LLM-native Markup Language to Leverage Structure and Semantics at the LLM Interface

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients

Inducing Sustained Creativity and Diversity in Large Language Models

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

How to do LLMs Compute Verbal Confidence

InCoder-32B: Code Foundation Model for Industrial Scenarios

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

Training for Technology: Adoption and Productive Use of Generative AI in Legal Analysis

When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

Evidential Neural Radiance Fields

Mitigating “Epistemic Debt” in Generative AI-Scaffolded Novice Programming using Metacognitive Scripts

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

How to Train Your Long-Context Visual Document Model

When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

Semantic Labeling for Third-Party Cybersecurity Risk Assessment: A Semi-Supervised Approach to Intent-Aware Question Retrieval

$V_0$: A Generalist Value Model for Any Policy at State Zero

PAIR-Former: Budgeted Relational MIL for miRNA Target Prediction

Temporal Sepsis Modeling: a Relational and Explainable-by-Design Framework

Dynamic Cogeneration of Bug Reproduction Test in Agentic Program Repair

The Mouth is Not the Brain: Bridging Energy-Based World Models and Language Generation

Hellinger Multimodal Variational Autoencoders

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

Provably Extracting the Features from a General Superposition

Stronger Normalization-Free Transformers

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

A Systematic Framework for Enterprise Knowledge Retrieval: Leveraging LLM - Generated Metadata to Enhance RAG Systems

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos

EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response

ZeroFlood: Flood Hazard Mapping from Single-Modality SAR Using Geo-Foundation Models

Automated Algorithm Design for Auto-Tuning Optimizers

MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization

Zero-Shot Coordination in Ad Hoc Teams with Generalized Policy Improvement and Difference Rewards

Local Causal Discovery for Statistically Efficient Causal Inference

ShishuLM: Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models

A Semi-amortized Lifted Learning-to-Optimize Masked (SALLO-M) Transformer Model for Scalable and Generalizable Beamforming

ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting

Past, Present, and Future of Bug Tracking in the Generative AI Era

TransFIRA: Transfer Learning for Face Image Recognizability Assessment

REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Learning Inter-Atomic Potentials without Explicit Equivariance

MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation

Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks

SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios

Incorporating LLM Embeddings for Variation Across the Human Genome

Generative AI on Wall Street -- Opportunities and Risk Controls

Improving Liver Disease Diagnosis with SNNDeep: A Custom Spiking Neural Network Using Diverse Learning Algorithms

Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning

TTA-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions

Generative Logic: A New Computer Architecture for Deterministic Reasoning and Knowledge Generation

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Streaming 4D Visual Geometry Transformer

LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Denoising the Future: Top-p Distributions for Moving Through Time

FA-INR: Adaptive Implicit Neural Representations for Interpretable Exploration of Simulation Ensembles

AI-Generated Compromises for Coalition Formation

Balancing Efficiency and Empathy: Healthcare Providers' Perspectives on AI-Supported Workflows for Serious Illness Conversations in the Emergency Department

LLM-Meta-SR: In-Context Learning for Evolving Selection Operators in Symbolic Regression

ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

Toward Low-Latency End-to-End Voice Agents for Telecommunications Using Streaming ASR, Quantized LLMs, and Real-Time TTS

Created by

Haebom

저자

Vignesh Ethiraj, Ashwath David, Sidhanth Menon, Divya Vijay

개요

본 논문은 저지연 통신 AI 음성 에이전트 파이프라인을 소개합니다. 실시간 양방향 통신을 위해 설계된 이 파이프라인은 콜센터 자동화, 지능형 IVR, AI 기반 고객 지원 등에 고급 음성 AI를 활용할 수 있도록 합니다. 네토AI가 개발한 네 가지 특수 모델(4비트 양자화된 통신 특화 대규모 언어 모델 TSLAM, 통신 특화 임베딩 모델 T-VEC, 통신 특화 자동 음성 인식 모델 TTE, 통신 특화 음성 합성 모델 T-Synth)을 통합하여 구축되었으며, 지식 기반 음성 상호 작용을 저지연으로 지원하는 높은 응답성의 도메인 적응형 음성 AI 에이전트를 가능하게 합니다. 스트리밍 ASR(TTE), 대화형 지능(TSLAM), 통신 문서에 대한 검색 증강 생성(RAG), 실시간 TTS(T-Synth)를 통합하여 통신 음성 비서에 대한 새로운 기준을 제시합니다. RFC에서 가져온 500개의 인간 녹음 통신 질문 데이터 세트를 사용하여 시스템을 평가하였으며, 지연 시간, 도메인 관련성 및 실시간 성능을 분석했습니다. 결과적으로 TSLAM, TTE 및 T-Synth는 1.0 미만의 실시간 계수(RTF)를 달성하여 기업용 저지연 통신 배포를 지원합니다.

시사점, 한계점

•

시사점:

◦

저지연 실시간 통신을 위한 고성능 AI 음성 에이전트 파이프라인 제시

◦

콜센터 자동화, 지능형 IVR, AI 기반 고객 지원 등 다양한 통신 분야에 적용 가능성 제시

◦

4비트 양자화를 통한 효율적인 모델 구현 및 저지연 성능 달성

◦

실시간 ASR, 대화형 지능, RAG, TTS 통합을 통한 종합적인 시스템 구축

◦

실제 통신 질문 데이터셋을 활용한 객관적인 성능 평가

◦

차세대 통신 AI 기반의 자동화된 고객 지원 및 진단 시스템 구축 가능성 제시

•

한계점:

◦

평가에 사용된 데이터셋의 규모(500개)가 상대적으로 작을 수 있음.

◦

특정 통신 도메인에 특화된 모델이므로 다른 도메인으로의 일반화 가능성에 대한 추가 연구 필요.

◦

실제 운영 환경에서의 장기간 안정성 및 확장성에 대한 검증 필요.

◦

T-VEC 모델에 대한 자세한 설명 및 평가 부족.

PDF 보기

Made with Slashpage