Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Created by
  • Haebom

Author

Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Liu, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen

Outline

This paper proposes a novel data format, PIN (Paired and Interleaved multimodal documents), to enhance the integration of visual and verbal information. PIN facilitates the deep integration of visual and textual information by combining semantically rich Markdown files with images that capture the entire document layout. Based on this format, we release two large-scale open-source datasets, PIN-200M (approximately 200 million documents) and PIN-14M (approximately 14 million documents), collected from various web and scientific sources in English and Chinese. These datasets include detailed statistical analysis and quality signals, enabling researchers to easily filter and select data suitable for specific tasks. This provides a foundation for new research on pretraining strategies and the development of knowledge-intensive large-scale multimodal models (LMMs).

Takeaways, Limitations

Takeaways:
We propose a new multimodal data format, PIN, which enables deep integration of visual and textual information.
Contribute to LMM research by providing large-scale open-source multimodal datasets PIN-200M and PIN-14M.
Increase data usability by providing detailed statistical analysis and quality signals.
Suggesting improved LMM pre-training strategies and their potential to contribute to the development of knowledge-intensive LMMs.
Limitations:
Further evaluation of the quality and diversity of the dataset may be necessary.
Further research may be needed on the general adoption and compatibility of PIN formats.
Consideration may need to be given to constructing datasets biased towards specific languages (English and Chinese).
👍