Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Layout-Aware OCR for Black Digital Archives with Unsupervised Evaluation

Created by
  • Haebom

Author

Fitsum Sileshi Beyene, Christopher L. Dancy

Outline

This paper focuses on digitization efforts for Black digital archives, particularly historical Black newspapers, which are structurally underrepresented in AI research and infrastructure. To address the challenges of accurate transcription due to inconsistent typography, visual degradation, and limited annotated layout data, this paper presents a layout-aware OCR pipeline tailored to Black newspaper archives and an unsupervised learning evaluation framework suitable for low-resource archival environments. Combining synthetic layout generation, model pretraining with augmented data, and state-of-the-art YOLO detector fusion, we evaluate a 400-page dataset of 10 Black newspaper titles using three unannotated evaluation metrics: semantic consistency score, region entropy, and text redundancy score. We demonstrate that layout-aware OCR improves structural diversity and reduces redundancy, with a slight trade-off in consistency.

Takeaways, Limitations

Takeaways:
A layout-aware OCR pipeline and an unsupervised learning evaluation framework are presented for digitizing Black newspaper archives.
Emphasize the importance of AI-based document understanding that respects cultural layout logic.
Presenting an OCR performance evaluation method suitable for low-resource archive environments.
Laying the foundation for the development of future community-driven and ethical archival AI systems.
Limitations:
Using a relatively small dataset of 400 pages
There are some trade-offs in consistency with layout-aware OCR.
Further research is needed to determine generalizability across different types of Black newspaper archives.
👍