This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Created by
Haebom
Author
Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang
Outline
This paper investigates the limitations of visual language models (VLMs) in recognizing text. While humans readily recognize words even when text is fragmented, fused, or partially occluded, state-of-the-art VLMs significantly degrade in these environments. This study evaluates the text recognition capabilities of VLMs by building a benchmark inspired by psychological experiments using Chinese ideographs and English alphabetic words. Experimental results show that while VLMs leverage general visual invariance, they suffer from structural limitations that prevent them from relying on the constructive prior knowledge necessary for robust literacy.
Takeaways, Limitations
•
Takeaways:
◦
We discovered structural Limitations for VLM's character recognition ability.
◦
It presents specific challenges for deploying multi-modal systems in the areas of education, accessibility, cultural heritage, and security.
◦
We emphasize the need to propose architectures and training strategies that encode symbol segmentation, composition, and bindings between different scripts.
•
Limitations:
◦
The cause of the decline in VLM's letter recognition ability was not identified and was generalized as a structural limitation.
◦
The provided benchmarks and evaluation protocols do not guarantee generality to other languages and writing systems.
◦
It only raised the need without presenting a specific architecture or training strategy.