Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Created by
  • Haebom

Author

Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Lin, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen

Outline

To address the limitations of large-scale multimodal models (LMMs), which struggle to integrate visual and verbal information, this paper proposes a novel data format, PIN (Paired and Interleaved multimodal documents). The PIN format facilitates the deep integration of visual and textual information by combining semantically rich Markdown files with images that capture the entire document layout. Building on this format, we release two large-scale open-source datasets: PIN-200M ( 200 million documents) and PIN-14M ( 14 million documents), collected from various web and scientific sources in English and Chinese. These datasets provide detailed statistical analysis and quality signals, enabling researchers to easily filter and select data for specific tasks.

Takeaways, Limitations

Takeaways:
We propose a new multimodal data format, PIN, which enables deep integration of visual and textual information.
Contribute to LMM research by providing large-scale open-source multimodal datasets PIN-200M and PIN-14M.
Increase the usability of your dataset by providing detailed statistical analysis and quality signals.
Provides a basis for research on the development of improved knowledge-intensive LMMs and pre-training strategies.
Limitations:
Additional analysis of the quality and bias of the dataset may be required.
A study is needed to understand the generality of the PIN format and to compare it with other multimodal data formats.
Although the dataset is large, there is a possibility that certain domains or types of data may be overrepresented.
👍