Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

Artificial Generals Intelligence: Mastering Generals.io with Reinforcement Learning

HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning

Ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry

Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Solving the Hubbard model with Neural Quantum States

S2FGL: Spatial Spectral Federated Graph Learning

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Description of the Training Process of Neural Networks via Ergodic Theorem: Ghost nodes

A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search

Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal Prediction

Studying and Improving Graph Neural Network-based Motif Estimation

Learning Algorithms in the Limit

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

MAEBE: Multi-Agent Emergent Behavior Framework

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Anchoring AI Capabilities in Market Valuations: The Capability Realization Rate Model and Valuation Misalignment Risk

Fair Uncertainty Quantification for Depression Prediction

MF-LLM: Simulating Population Decision Dynamics via a Mean-Field Large Language Model Framework

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size

Toward Holistic Evaluation of Recommender Systems Powered by Generative Models

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Ethical Concerns of Generative AI and Mitigation Strategies: A Systematic Mapping Study

Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval

Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian Process

Cosmos World Foundation Model Platform for Physical AI

Enhancing Transformers for Generalizable First-Order Logical Entailment

Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid Robots for Multimodal Understanding

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Understanding Chain-of-Thought in LLMs through Information Theory

A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines

MarineFormer: A Spatio-Temporal Attention Model for USV Navigation in Dynamic Marine Environments

HARMONIC: Cognitive and Control Collaboration in Human-Robotic Teams

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Masked Image Modeling: A Survey

Time Makes Space: Emergence of Place Fields in Networks Encoding Temporally Continuous Sensory Experiences

Curriculum Negative Mining For Temporal Networks

C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Solving Probabilistic Verification Problems of Neural Networks using Branch and Bound

Offline Trajectory Optimization for Offline Reinforcement Learning

Structure Guided Large Language Model for SQL Generation

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Unsupervised Automata Learning via Discrete Optimization

Don't Get Me Wrong: How to Apply Deep Visual Interpretations to Time Series

An Algorithm for Learning Smaller Representations of Models With Scarce Data

GTA1: GUI Test-time Scaling Agent

Fuzzy Classification Aggregation for a Continuum of Agents

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

AI's Euclid's Elements Moment: From Language Models to Computable Thought

Closer to Language than Steam: AI as the Cognitive Engine of a New Productivity Revolution

Access Controls Will Solve the Dual-Use Dilemma

Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning

Affordable AI Assistants with Knowledge Graph of Thoughts

Deontic Temporal Logic for Formal Verification of AI Ethics

Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading

Constrain Alignment with Sparse Autoencoders

Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: A Mixed Approach on Multi-Layer Networks

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

PyVision: Agentic Vision with Dynamic Tooling

Single-pass Adaptive Image Tokenization for Minimum Program Search

Multigranular Evaluation for Brain Visual Decoding

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

EXPO: Stable Reinforcement Learning with Expressive Policies

Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Why is Your Language Model a Poor Implicit Reward Model?

Reinforcement Learning with Action Chunking

Scaling RL to Long Videos

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Low Resource Reconstruction Attacks Through Benign Prompts

Probing Experts' Perspectives on AI-Assisted Public Speaking Training

Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice

DTECT: Dynamic Topic Explorer & Context Tracker

Agentic Retrieval of Topics and Insights from Earnings Calls

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Created by

Haebom

Author

Ruiyang Qin, Dancheng Liu, Gelei Xu, Zheyu Yan, Chenhui Xu, Yuting Hu, X. Sharon Hu, Jinjun Xiong, Yiyu Shi

Outline

In this paper, we propose a personal assistant system that enables personalized voice-based interactions by leveraging a combination of large-scale language models (LLMs) and automatic speech recognition (ASR) running on edge devices (edge ASR-LLM). Existing ASR-LLM models are trained in high-performance computing environments and have large model sizes, making them difficult to deploy on edge devices. Instead of fine-tuning ASR or LLM individually, in this paper, we present a resource-efficient framework for efficient cross-modal alignment on edge devices. Our framework enables efficient ASR-LLM alignment even on resource-constrained devices such as NVIDIA Jetson Orin (8GB RAM), reducing the training time by 50x while improving the alignment quality by more than 50%. This is the first study to investigate efficient ASR-LLM alignment on resource-constrained edge devices.

Takeaways, Limitations

•

Takeaways:

◦

We present an ASR-LLM framework for efficient personalized voice-based interactions on edge devices.

◦

Reduced training time and improved alignment quality in resource-constrained environments (50x speedup, more than 50% quality improvement).

◦

Presenting the possibility of effective processing of personalized voice input.

◦

Presenting new possibilities for cross-modal alignment research in edge devices.

•

Limitations:

◦

Results are presented only for specific edge devices, such as NVIDIA Jetson Orin (8GB RAM), and generalizability to other hardware environments needs to be verified.

◦

Further research is needed on robustness to different types of voice data and user characteristics.

◦

Further evaluation of the performance and stability of the proposed framework in real-world usage environments is needed.

◦

Lack of specific analysis of energy efficiency.

View PDF

Made with Slashpage