Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CTA: Cross-Task Alignment for Better Test Time Training

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Domain Generalizable Portrait Style Transfer

StreamDiT: Real-Time Streaming Text-to-Video Generation

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Neural-Network solver of ideal MHD equilibria

RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Hita: Holistic Tokenizer for Autoregressive Image Generation

Empirical Analysis Of Heuristic and Approximation Algorithms for the Mutual-Visibility Problem

Horus: A Protocol for Trustless Delegation Under Uncertainty

Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization

Instruction Following by Boosting Attention of Large Language Models

Evaluating Logit-Based GOP Scores for Mispronunciation Detection

LLMs on support of privacy and security of mobile apps: state of the art and research directions

On the Fundamental Impossibility of Hallucination Control in Large Language Models

Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting

CuVSLAM: CUDA accelerated visual odometry and mapping

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge

An empirical study of task and feature correlations in the reuse of pre-trained models

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Towards General Continuous Memory for Vision-Language Models

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

Bayesian Hierarchical Invariant Prediction

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

Overcoming Data Scarcity in Generative Language Modeling for Low-Resource Languages: A Systematic Review

The GenAI Generation: Student Views of Awareness, Preparedness, and Concern

Variational OOD State Correction for Offline Reinforcement Learning

Heat Diffusion Models -- Interpixel Attention Mechanism

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Offline Learning and Forgetting for Reasoning with Large Language Models

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

PVChat: Personalized Video Chat with One-Shot Learning

Challenges and Trends in Egocentric Vision: A Survey

Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

A Survey on Transformer Context Extension: Approaches and Evaluation

Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government

A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

RSPO: Regularized Self-Play Alignment of Large Language Models

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Efficient Risk-sensitive Planning via Entropic Risk Measures

Bayesian Optimization for Controlled Image Editing via LLMs

Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment

Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

A Theory for Conditional Generative Modeling on Multiple Data Sources

Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution

Aria-UI: Visual Grounding for GUI Instructions

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning

Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model

An AI Theory of Mind Will Enhance Our Collective Intelligence

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Longitudinal Ensemble Integration for sequential classification with multimodal data

Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI

Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort

Empirical evidence of Large Language Model's influence on human spoken communication

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics

CoDy: Counterfactual Explainers for Dynamic Graphs

Optimal Transport for Domain Adaptation through Gaussian Mixture Models

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

Detecting value-expressive text posts in Russian social media

Deep neural networks have an inbuilt Occam's razor

TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

MedGemma Technical Report

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Activation Steering for Chain-of-Thought Compression

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Created by

Haebom

Author

Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang

Outline

In this paper, we present OpenS2S, a fully open-source, transparent, end-to-end large-scale language model (LSLM) for empathic voice interaction. OpenS2S achieves low-latency speech generation using a streaming interleaved decoding architecture based on the empathic speech-to-text model BLSP-Emo. It integrates an automated data construction pipeline that synthesizes diverse, high-quality, empathic voice conversations at low cost, facilitating end-to-end learning. We leverage large-scale language models to generate empathic content, and introduce speaker and emotional variation using a controllable text-to-speech system, creating a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pretraining, and fine-tuning code, to support the broader research community and accelerate innovation in empathic voice systems.

Takeaways, Limitations

•

Takeaways:

◦

Accelerating research accessibility and innovation by providing a fully open-source LSLM for empathic voice interactions.

◦

Leveraging streaming interleaved decoding architecture for low-latency speech generation.

◦

Build large-scale datasets inexpensively and efficiently with automated data construction pipelines.

◦

Providing a scalable training corpus with rich paralinguistic diversity.

•

Limitations:

◦

This paper does not present specific evaluation results on the performance of the OpenS2S model.

◦

Lack of detailed analysis of the quality and bias of the dataset.

◦

Comparative analysis with other empathic LSLMs is needed.

◦

Additional validation of performance and stability in real application environments is required.

View PDF

Made with Slashpage