MinerU - 고품질 PDF 변환 및 데이터 추출 도구
Taylor
Task Type | Description | Models |
Layout Detection | Locate different elements in a document: including images, tables, text, titles, formulas | DocLayout-YOLO_ft, YOLO-v10_ft, LayoutLMv3_ft |
Formula Detection | Locate formulas in documents: including inline and block formulas | YOLOv8_ft |
Formula Recognition | Recognize formula images into LaTeX source code | UniMERNet |
OCR | Extract text content from images (including location and recognition) | PaddleOCR |
Table Recognition | Recognize table images into corresponding source code (LaTeX/HTML/Markdown) | PaddleOCR+TableMaster, StructEqTable |
Reading Order | Sort and concatenate discrete text paragraphs | Coming Soon! |
# 가상환경 생성 및 활성화
conda create -n mineru python=3.10
conda activate mineru
# MinerU 설치
pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com# PDF를 Markdown으로 변환
magic-pdf convert input.pdf -o output_folder
# 세부 설정 적용하여 변환
magic-pdf convert input.pdf -o output_folder --table.enable true --formula.enable truefrom magic_pdf import MagicPDF
# MagicPDF 객체 생성
magic_pdf = MagicPDF()
# PDF 파일 로드
doc = magic_pdf.load_pdf("input.pdf")
# Markdown으로 변환
magic_pdf.convert(doc, output_path="output_folder")