Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CTA: Cross-Task Alignment for Better Test Time Training

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Domain Generalizable Portrait Style Transfer

StreamDiT: Real-Time Streaming Text-to-Video Generation

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Neural-Network solver of ideal MHD equilibria

RAG-R1: Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Hita: Holistic Tokenizer for Autoregressive Image Generation

Empirical Analysis Of Heuristic and Approximation Algorithms for the Mutual-Visibility Problem

Horus: A Protocol for Trustless Delegation Under Uncertainty

Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization

Instruction Following by Boosting Attention of Large Language Models

Evaluating Logit-Based GOP Scores for Mispronunciation Detection

LLMs on support of privacy and security of mobile apps: state of the art and research directions

On the Fundamental Impossibility of Hallucination Control in Large Language Models

Integrating Spatiotemporal Features in LSTM for Spatially Informed COVID-19 Hospitalization Forecasting

CuVSLAM: CUDA accelerated visual odometry and mapping

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge

An empirical study of task and feature correlations in the reuse of pre-trained models

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Towards General Continuous Memory for Vision-Language Models

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

Bayesian Hierarchical Invariant Prediction

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

Overcoming Data Scarcity in Generative Language Modeling for Low-Resource Languages: A Systematic Review

The GenAI Generation: Student Views of Awareness, Preparedness, and Concern

Variational OOD State Correction for Offline Reinforcement Learning

Heat Diffusion Models -- Interpixel Attention Mechanism

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Offline Learning and Forgetting for Reasoning with Large Language Models

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

PVChat: Personalized Video Chat with One-Shot Learning

Challenges and Trends in Egocentric Vision: A Survey

Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

A Survey on Transformer Context Extension: Approaches and Evaluation

Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government

A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Zero-shot Medical Event Prediction Using a Generative Pre-trained Transformer on Electronic Health Records

GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification

Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

RSPO: Regularized Self-Play Alignment of Large Language Models

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Efficient Risk-sensitive Planning via Entropic Risk Measures

Bayesian Optimization for Controlled Image Editing via LLMs

Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment

Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

A Theory for Conditional Generative Modeling on Multiple Data Sources

Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution

Aria-UI: Visual Grounding for GUI Instructions

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Random Walks with Tweedie: A Unified View of Score-Based Diffusion Models

Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning

Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model

An AI Theory of Mind Will Enhance Our Collective Intelligence

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Longitudinal Ensemble Integration for sequential classification with multimodal data

Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI

Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort

Empirical evidence of Large Language Model's influence on human spoken communication

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics

CoDy: Counterfactual Explainers for Dynamic Graphs

Optimal Transport for Domain Adaptation through Gaussian Mixture Models

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

Detecting value-expressive text posts in Russian social media

Deep neural networks have an inbuilt Occam's razor

TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

MedGemma Technical Report

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Activation Steering for Chain-of-Thought Compression

Towards General Continuous Memory for Vision-Language Models

Created by

Haebom

Author

Wenyi Wu, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang

Outline

In this paper, we propose an external memory system that efficiently provides multimodal and multilingual real-world knowledge to address the limitations of existing language models (LMs) and visual-language models (VLMs) that struggle to perform complex inference tasks. While existing approaches concatenate images and text tokens into long sequences, in this paper, we use continuous memory, a compact set of dense embeddings, to represent multimodal and multilingual knowledge more effectively and efficiently. The key idea is that the VLM itself can act as a continuous memory encoder. This improves the performance of complex multimodal inference tasks, and we present a data- and parameter-efficient method to fine-tune the VLM as a memory encoder using only 1.2% of the model parameters and 15.6K self-synthesized samples. The proposed method, called CoMEM, encodes arbitrary multimodal and multilingual knowledge into just eight continuous embeddings, and the VLM remains fixed during inference, allowing it to be flexibly integrated in a plug-and-play manner. We demonstrate the effectiveness of our approach through extensive experiments on eight multimodal inference benchmarks.

Takeaways, Limitations

•

Takeaways:

◦

Achieving performance improvement of complex multi-modal inference tasks through a sequential memory system that efficiently utilizes VLM.

◦

We present a data and parameter-efficient fine-tuning method.

◦

Flexible integration with plug-and-play modules.

◦

Proven effective on various multi-modal inference benchmarks.

•

Limitations:

◦

Further validation is needed on the generalization performance of fine-tuning methods relying on our own synthetic data.

◦

Further research is needed to determine whether the size of the continuous memory (8 embeddings) is sufficient for all kinds of complex inference tasks.

◦

There may be a dependency on a specific VLM architecture.

View PDF

Made with Slashpage