Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PVChat: Personalized Video Chat with One-Shot Learning

Created by
  • Haebom

Author

Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Richard Yu, Ming Li, Si Yong Yeo

Outline

In this paper, we propose PVChat, a personalized video large-scale language model (ViLLM). The existing ViLLM has limitations in understanding specific individuals, such as "Wilson is undergoing chemotherapy", while PVChat is designed to enable question-answering (QA) for a specific individual with only a single video. It uses a method of training ViLLM with mixed head (MoH) enhancement on a synthetically extended video-QA dataset. To this end, we introduce an automatic augmentation pipeline that synthesizes positive samples that maintain personal identification information and retrieves difficult speech samples from existing video data, generating various types of QA data such as presence, appearance, action, and location questions. In addition, we propose a ReLU-routed MoH attention mechanism and two new objective functions (Smooth Proximity Regularization, Head Activation Enhancement) to enhance personal feature learning. It enables incremental learning from static attributes to dynamic representations through a two-stage learning strategy that proceeds from image pre-training to video fine-tuning. It outperforms the existing state-of-the-art ViLLM on various datasets such as medical scenarios, TV series, animations, and real-world videos.

Takeaways, Limitations

Takeaways:
We present PVChat, a ViLLM capable of personalized video understanding with single video learning.
Expanding application possibilities in various fields such as medical and smart home.
Improving ViLLM performance with synthetic data augmentation and novel learning strategies.
Ability to answer a variety of questions while maintaining personally identifiable information.
Limitations:
Need to verify generalization performance of learning methods that rely on synthetic data.
Further research is needed on robustness in real-world complex situations.
Further analysis is needed to determine the accuracy of identifying specific individuals.
There is a need to consider the impact of dataset bias on model performance.
👍