[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization

Created by
  • Haebom

Author

Martin Ki\v{s}\v{s}, Michal Hradi\v{s}, Martina Dvo\v{r} akov a, V aclav Jirou\v{s}ek, Filip Kersch

Outline

The AnnoPage Dataset is a new dataset containing 7,550 historical document pages in Czech and German from 1485 to the present. It focuses on documents from the late 19th and early 20th centuries and is designed to support document layout analysis and object detection studies. Each page is annotated with axis-aligned bounding boxes (AABBs) representing 25 non-text element categories, including images, maps, decorative elements, and charts, following the Czech Image Document Processing Methodology. The annotations were written by a professional librarian to ensure accuracy and consistency. The pages from several historical document datasets are combined to increase diversity and maintain continuity. The dataset is divided into development and test subsets, with the test set being carefully chosen to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, providing a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo ( https://doi.org/10.5281/zenodo.12788419) along with the correct annotations in YOLO format.

Takeaways, Limitations

Takeaways:
Providing a new large-scale dataset for layout analysis and object detection research in historical documents.
Providing accurate and consistent annotations based on Czech image document processing methodology.
Contributes to improving generalization performance by including various historical document pages.
Providing baseline performance based on YOLO and DETR to establish a comparison standard for future studies.
Publicly accessible and contributing to the research community.
Limitations:
It focuses mainly on Czech and German documents and may be difficult to apply to documents in other languages.
May lack historical diversity due to a bias toward documents from the late 19th and early 20th centuries.
Only 25 non-text element categories are included, so a more refined classification may be needed.
The size of the dataset may be relatively small compared to other large datasets.
👍