Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Learning Unified User Quantized Tokenizers for User Representation

Created by
  • Haebom

Author

Chuan He, Yang Chen, Wuliang Huang, Tianyi Zheng, Jianhu Chen, Bin Dou, Yice Luo, Yun Zhu, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Xin-Wei Yao, Zhongle Xie

Outline

This paper explores multi-source user representation learning, which plays a crucial role in providing personalized services on web platforms. Previous studies have used late-fusion approaches to combine heterogeneous data sources, but they suffer from three major limitations: the lack of a unified representation framework, data compression scalability and storage issues, and inflexible cross-task generalization. To address these challenges, this paper proposes U2QT (Unified User Quantized Tokenizers), a novel framework that integrates early fusion of heterogeneous domains with cross-domain knowledge transfer. U2QT derives compact yet expressive feature representations using the Qwen3 embedding model and discretizes causal embeddings into compact tokens using a shared, source-specific codebook via multi-view RQ-VAE, ensuring efficient storage and semantic consistency. Experimental results demonstrate that U2QT outperforms task-specific baselines in future action prediction and recommendation tasks, while demonstrating advantages in various downstream tasks, including storage and computational efficiency. This unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.

Takeaways, Limitations

Takeaways:
Effectively fuse heterogeneous data sources to improve the performance of personalized services.
Addresses scalability and storage issues, increasing applicability in large-scale data environments.
Integration with language models opens up applicability to a variety of tasks.
It shows superior performance compared to existing methodologies in future action prediction and recommendation tasks.
Limitations:
Lack of detailed information on specific performance metrics and comparison targets.
Further experiments and analysis are needed to determine the model's generalization ability.
Further validation of U2QT application cases in real industrial environments is needed.
👍