Python 3.10 신규 문법 : Parenthesized context managers와 PEG Parser
S
seunghoChoe
Tech
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
최
최윤진
Tech
[팀 소개편] KPMG Lighthouse는 어떤 팀인가요?
L
Lighthouse
Team
[챕터 소개편] Backend Chapter를 소개합니다
L
Lighthouse
Team
[챕터 소개편]Frontend Chapter를 소개합니다
L
Lighthouse
Team
[챕터 소개편] AI Chapter를 소개합니다
L
Lighthouse
Team
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
Created by
최
최윤진
Created at
Category
Tech
•
제목 : UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
universal OCR-free visually-situated language understanding based MLLM
◦
only finetuned 1.2% parameters & Low Cost
◦
Unified Instruction Format
▪
다양한 데이터셋을**(Instruction Tuning) 형식**으로 변환
•
Text Reading
•
Key Points Generation
◦
shape-adaptive cropping module 적용
◦
MLLM 베이스는 mPLUG-Owl
◦
데이터 종류
▪
문서(Document)
▪
표(Table)
▪
차트(Chart)
▪
자연 이미지(Natural Image)
▪
웹페이지 스크린샷(Webpage Screenshot)
•
Our contributions in this work are four-fold:
◦
We first propose instruction tuning with Multimodal Large Language Models for OCR-free Visually-situated Language Understanding.
◦
We build an instruction-tuning dataset covering 5 domains of visually-situated language understanding: document, table, chart, natural image, and webpage screenshot.
◦
We design a shape-adaptive cropping module to utilize the frozen low-resolution vision encoder for processing high-resolution images.
◦
UReader achieves state-of-the-art OCR-free performance in 8 out of 10 tasks, across 5 domains.
1. Introduction
•
기존 MLLM 으로 Text Rich 분야에서는 성능 안좋음
•
visually-situated language understanding 방법론은 크게 2가지
◦
Two Stage 모델
▪
기존 OCR 모델 or API 활용
◦
End2End 모델
▪
high training costs
2. Related Work
•
Two-stage 모델 (OCR 사용)
◦
OCR 모델/API를 활용하여 이미지에서 텍스트를 인식 후 언어 모델 처리
▪
UDOP (Tang et al., 2023): 텍스트-레이아웃 복원(Joint Text-Layout Reconstruction) 작업 설계
▪
LayoutLMv3 (Huang et al., 2022): Masked Image Modeling을 활용해 이미지 토큰 복원
•
End-to-End 모델 (OCR 없이 직접 학습)
◦
고해상도 이미지 인코더를 사용하여 텍스트 인식 학습
▪
Pix2Struct (Lee et al., 2022): 웹페이지 스크린샷만으로 HTML DOM 트리 생성
▪
Donut (Kim et al., 2022): 문서 이미지의 모든 텍스트를 생성하는 사전학습 설계 (학습 비용: 192 A100-days)
3. UReader 모델 개요
•
The input image is firstly pre-processed by a shape-adaptive cropping module
•
The resulting sub-images are then simultaneously passed through the visual encoder and visual abstractor
•
we apply a crop position encoding module to introduce spatial information across sub-images.