Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Created by
  • Haebom

Author

Parshva Dhilankumar Patel

Outline

This paper presents the design and development of a pipeline that efficiently extracts tabular data from invoices using Optical Character Recognition (OCR) technology. Text is recognized using Tesseract OCR, and structured tabular data is detected, aligned, and extracted from scanned invoice documents using custom postprocessing logic. The method includes dynamic preprocessing, table boundary detection, and row-to-column mapping optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.

Takeaways, Limitations

Takeaways:
Contributes to improving the efficiency and accuracy of OCR-based invoice table data extraction pipeline.
Building a robust system that can handle noisy and non-standard invoice formats.
Demonstrates applicability in real-world applications such as automated financial workflows and digital archiving.
Limitations:
Further research is needed on the level of optimization and generalizability for specific invoice formats.
Lack of performance evaluation and comparative analysis of different invoice formats.
Depends on the performance of Tesseract OCR. OCR errors may affect the final results.
Lack of validation of processing performance for complex table structures or corrupted invoices.
👍