Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

Artificial Generals Intelligence: Mastering Generals.io with Reinforcement Learning

HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning

Ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry

Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Solving the Hubbard model with Neural Quantum States

S2FGL: Spatial Spectral Federated Graph Learning

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Description of the Training Process of Neural Networks via Ergodic Theorem: Ghost nodes

A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search

Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal Prediction

Studying and Improving Graph Neural Network-based Motif Estimation

Learning Algorithms in the Limit

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models

HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

MAEBE: Multi-Agent Emergent Behavior Framework

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Anchoring AI Capabilities in Market Valuations: The Capability Realization Rate Model and Valuation Misalignment Risk

Fair Uncertainty Quantification for Depression Prediction

MF-LLM: Simulating Population Decision Dynamics via a Mean-Field Large Language Model Framework

A Cryptographic Perspective on Mitigation vs. Detection in Machine Learning

Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size

Toward Holistic Evaluation of Recommender Systems Powered by Generative Models

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Decoding AI Judgment: How LLMs Assess News Credibility and Bias

Ethical Concerns of Generative AI and Mitigation Strategies: A Systematic Mapping Study

Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval

Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian Process

Cosmos World Foundation Model Platform for Physical AI

Enhancing Transformers for Generalizable First-Order Logical Entailment

Multi-Scenario Reasoning: Unlocking Cognitive Autonomy in Humanoid Robots for Multimodal Understanding

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Understanding Chain-of-Thought in LLMs through Information Theory

A Multi-Granularity Supervised Contrastive Framework for Remaining Useful Life Prediction of Aero-engines

MarineFormer: A Spatio-Temporal Attention Model for USV Navigation in Dynamic Marine Environments

HARMONIC: Cognitive and Control Collaboration in Human-Robotic Teams

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Masked Image Modeling: A Survey

Time Makes Space: Emergence of Place Fields in Networks Encoding Temporally Continuous Sensory Experiences

Curriculum Negative Mining For Temporal Networks

C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Solving Probabilistic Verification Problems of Neural Networks using Branch and Bound

Offline Trajectory Optimization for Offline Reinforcement Learning

Structure Guided Large Language Model for SQL Generation

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Unsupervised Automata Learning via Discrete Optimization

Don't Get Me Wrong: How to Apply Deep Visual Interpretations to Time Series

An Algorithm for Learning Smaller Representations of Models With Scarce Data

GTA1: GUI Test-time Scaling Agent

Fuzzy Classification Aggregation for a Continuum of Agents

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

AI's Euclid's Elements Moment: From Language Models to Computable Thought

Closer to Language than Steam: AI as the Cognitive Engine of a New Productivity Revolution

Access Controls Will Solve the Dual-Use Dilemma

Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning

Affordable AI Assistants with Knowledge Graph of Thoughts

Deontic Temporal Logic for Formal Verification of AI Ethics

Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading

Constrain Alignment with Sparse Autoencoders

Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: A Mixed Approach on Multi-Layer Networks

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

PyVision: Agentic Vision with Dynamic Tooling

Single-pass Adaptive Image Tokenization for Minimum Program Search

Multigranular Evaluation for Brain Visual Decoding

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

EXPO: Stable Reinforcement Learning with Expressive Policies

Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Why is Your Language Model a Poor Implicit Reward Model?

Reinforcement Learning with Action Chunking

Scaling RL to Long Videos

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Low Resource Reconstruction Attacks Through Benign Prompts

Probing Experts' Perspectives on AI-Assisted Public Speaking Training

Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice

DTECT: Dynamic Topic Explorer & Context Tracker

Agentic Retrieval of Topics and Insights from Earnings Calls

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Created by

Haebom

Author

Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li

Outline

Considering the trend of utilizing various forms of multimedia such as images, short videos, and live streams in e-commerce, this paper proposes a vectorized product representation learning method that integrates various domains. We point out that existing visual information alone is not effective in a wide range of domains with high intra-product variation and inter-product similarity, and propose a method that utilizes automatic speech recognition (ASR) texts obtained from short videos or live streams. In particular, we propose the AMPere (ASR-enhanced Multimodal Product Representation Learning) model, which extracts product-related information from noisy ASR texts using an LLM-based ASR text summarizer and inputs it to a multi-branch network together with visual data to generate compressed multimodal embeddings. We verify the effectiveness of AMPere through experiments using a large-scale triple-domain dataset and demonstrate that it improves cross-domain product retrieval performance.

Takeaways, Limitations

•

Takeaways:

◦

We present a method to effectively extract product information from noisy ASR texts by utilizing an LLM-based text summarizer.

◦

Proposing AMPere, a multi-modal learning model that comprehensively represents products from various domains.

◦

Validate the superiority of AMPere and confirm improved cross-domain product search performance through experiments using large-scale datasets.

•

Limitations:

◦

May be highly dependent on the performance of LLM-based summarizers.

◦

Generalization performance may be limited depending on the characteristics of the dataset used.

◦

Additional comparative analysis with other multimodal learning models is needed.

View PDF

Made with Slashpage