Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding

Created by
  • Haebom

Author

Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu

Outline

PP-DocBee2 is an advanced version of PP-DocBee designed to improve multi-modal document understanding. Based on the large-scale multi-modal model architecture, PP-DocBee2 overcomes the limitations of the previous version with major technical improvements, including improved synthetic data quality, improved visual feature fusion strategy, and optimized inference methodology. These improvements result in an 11.4% performance improvement on our internal benchmark on Chinese business documents, and a 73.0% reduction in inference latency compared to the baseline version. The key innovation is the data quality optimization strategy for multi-modal document tasks. By evaluating data using large-scale multi-modal pre-trained models, we apply novel statistical criteria to filter out outliers and ensure high-quality training data. Based on insights into the underutilized intermediate features in multi-modal models, we decompose the ViT representation ability into layers and apply a novel feature fusion strategy to improve complex inference. The source code and pre-trained models are available at https://github.com/PaddlePaddle/PaddleMIX .

Takeaways, Limitations

Takeaways:
Improved multi-modal document understanding performance (11.4% performance improvement and 73.0% reduction in inference latency) through improved synthetic data quality, visual feature fusion strategy, and optimized inference methodology.
A data quality optimization strategy using a large-scale multi-modal pre-trained model is presented.
Hierarchical decomposition and novel feature fusion strategies are proposed to improve ViT representation capability.
Improving accessibility through open source disclosure.
Limitations:
Lack of validation of generalization performance on external datasets based on internal benchmark-based evaluation.
Further research is needed to determine the generalizability of data quality optimization strategies and their applicability to other languages/document types.
These results are for specific Chinese business documents, and further research is needed to determine generalizability to other languages or document types.
👍