[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VLA-Mark: A cross modal watermark for large vision-language alignment model

Created by
  • Haebom

Author

Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, Xuming Hu

Outline

In this paper, we propose VLA-Mark, a novel watermarking technique for protecting intellectual property rights of visual-language models. Existing text watermarking techniques can compromise visual-language consistency and make semantically important concepts vulnerable due to token selection bias and static strategies. VLA-Mark integrates multi-scale visual-language consistency metrics (local patch similarity, global semantic consistency, and contextual attention patterns) to effectively embed watermarks without model retraining while maintaining semantic fidelity. The entropy-sensitive mechanism dynamically adjusts the balance between watermark strength and semantic preservation, and prioritizes visual evidence in the generation stage where uncertainty is low. Experimental results show that it achieves 7.4% lower PPL and 26.6% higher BLEU than existing methods, and achieves almost perfect detection rate (98.8% AUC). In addition, it presents a new standard for high-quality multimodal watermarking by maintaining text-to-visual consistency while maintaining high attack resistance of 96.1% against attacks such as paraphrasing and synonym substitution.

Takeaways, Limitations

Takeaways:
A novel method for effectively embedding watermarking in visual-language models without model retraining is presented.
Achieves improved performance (PPL, BLEU) and high detection rate, attack resistance compared to existing methods
Development of a technology to insert watermarking while maintaining visual-linguistic consistency
Setting a new standard for high-quality multi-modal watermarking
Limitations:
Resistance to attacks other than those presented in this paper requires further study.
There is a need to evaluate generalization performance on various visual-language models and datasets.
Further research is needed to determine the optimal parameters of the entropy-sensitive mechanism.
👍