POST

All

Product

Team

Tech

DocVLM: Make Your VLM an Efficient Reader

최

최윤진

Tech

2025/03/06 1:57 AM

Python 3.10 신규 문법 : Parenthesized context managers와 PEG Parser

seunghoChoe

Tech

2025/03/14 5:56 AM

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

최

최윤진

Tech

2025/03/20 7:23 AM

[팀 소개편] KPMG Lighthouse는 어떤 팀인가요?

Lighthouse

Team

2025/03/21 4:21 AM

[챕터 소개편] Backend Chapter를 소개합니다

Lighthouse

Team

2025/03/21 4:37 AM

[챕터 소개편]Frontend Chapter를 소개합니다

Lighthouse

Team

2025/03/21 4:43 AM

[챕터 소개편] AI Chapter를 소개합니다

Lighthouse

Team

2025/03/21 4:43 AM

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Created by

최
최윤진

Created at

2025/03/20 7:23 AM

Abstract

•

UReader

◦

universal OCR-free visually-situated language understanding based MLLM

◦

only finetuned 1.2% parameters & Low Cost

◦

Unified Instruction Format

▪

다양한 데이터셋을**(Instruction Tuning) 형식**으로 변환

•

Text Reading

•

Key Points Generation

◦

shape-adaptive cropping module 적용

◦

MLLM 베이스는 mPLUG-Owl

◦

데이터 종류

▪

문서(Document)

▪

표(Table)

▪

차트(Chart)

▪

자연 이미지(Natural Image)

▪

웹페이지 스크린샷(Webpage Screenshot)

•

Our contributions in this work are four-fold:

◦

We first propose instruction tuning with Multimodal Large Language Models for OCR-free Visually-situated Language Understanding.

◦

We build an instruction-tuning dataset covering 5 domains of visually-situated language
understanding: document, table, chart, natural image, and webpage screenshot.

◦

We design a shape-adaptive cropping module to utilize the frozen low-resolution vision encoder for processing high-resolution images.

◦

UReader achieves state-of-the-art OCR-free performance in 8 out of 10 tasks, across 5 domains.

1. Introduction

•

기존 MLLM 으로 Text Rich 분야에서는 성능 안좋음

•

visually-situated language understanding 방법론은 크게 2가지

◦

Two Stage 모델

▪

기존 OCR 모델 or API 활용

◦

End2End 모델

▪

high training costs

2. Related Work

•

Two-stage 모델 (OCR 사용)

◦

OCR 모델/API를 활용하여 이미지에서 텍스트를 인식 후 언어 모델 처리

▪

UDOP (Tang et al., 2023): 텍스트-레이아웃 복원(Joint Text-Layout Reconstruction) 작업 설계

▪

LayoutLMv3 (Huang et al., 2022): Masked Image Modeling을 활용해 이미지 토큰 복원

•

End-to-End 모델 (OCR 없이 직접 학습)

◦

고해상도 이미지 인코더를 사용하여 텍스트 인식 학습

▪

Pix2Struct (Lee et al., 2022): 웹페이지 스크린샷만으로 HTML DOM 트리 생성

▪

Donut (Kim et al., 2022): 문서 이미지의 모든 텍스트를 생성하는 사전학습 설계 (학습 비용: 192 A100-days)

3. UReader 모델 개요

•

The input image is firstly pre-processed by a shape-adaptive cropping module

•

The resulting sub-images are then simultaneously passed through the visual encoder and visual abstractor

•

we apply a crop position encoding module to introduce spatial information across sub-images.

3.1 Shape Adaptive Cropping Module

•

Grid : {g = (nh × nw)|nh · nw ≤ Nc, nh ∈ N, nw ∈ N}

◦

위 그림의 Pre-defined Grids 처럼 Grid Pool 을 만들고 최적의 Grid 를 구한다.

•

two rules should be followed:

◦

(1) The grid should preserve the resolution of the image as much as possible

◦

(2) the grid should fit the aspect ratio of the input image

•

위 룰을 최적/최대화 하는 스코어를 정의

◦

To measure the resolution coherence and shape similarity between the image and each grid,

•

위 룰 기반으로 평가했을 때 위 예시 그림은 2 row * 3 col 로 grid 를 나누는게 이미지의 정보를 제대로 보존한다고 판단됨.

•

Visual Encoder

◦

이렇게 만들어진 Cropped Images 는 Visual Encoder 를 통과

▪

이 과정에서 Global Image 도 +1 로 들어감.

•

Visual Abstractor

◦

65개의 Learnable query 로 images 정보를 Abstract

3.2 Cropped Images Modeling with LLM

•

cropped img 의 row, col 에 대한 positional encoding 정보를 더해줌.

•

LLM 은 Freezing , LoRA 로 학습 (16 A100 GPU days)

4. Instruction-Tuning

Unified Downstream Tasks

•

모든 작업을 Instruction Tuning 형식으로 변환

◦

Visual Question Answering (VQA)

▪

"Human: {question} AI: {answer}"

◦

정보 추출 (Information Extraction)

▪

"Human: What is the value for the {category}? AI: {value}"

▪

해당 카테고리가 없을 경우 "None" 반환

◦

자연어 추론 (Natural Language Inference, NLI)

▪

원본 레이블: 1(Entailed), 0(Refuted)

▪

변환 예시: "Human: {statement}, Yes or No? AI: {answer}"

◦

이미지 캡션 생성 (Image Captioning)

▪

LLaVA (Liu et al., 2023a)에서 11개 프롬프트 활용

▪

"Human: Provide a brief description of the given image. AI: {caption}"

Auxiliary Tasks 추가

•

텍스트 읽기 (Text Reading Task)

◦

다양한 도메인에서 텍스트 인식 능력 강화 목적

◦

일반적인 읽기 순서(위에서 아래, 왼쪽에서 오른쪽) 적용

◦

모델이 특정 부분만 집중적으로 학습하는 현상을 방지하기 위해 무작위 분할 방식 도입

▪

특정 지점에서 입력을 분할 → 앞쪽을 입력, 뒤쪽을 타겟으로 설정

◦

두 가지 프롬프트 그룹 사용:

▪

시작부터 읽기: "Human: Recognize text in the image. AI: {all texts}"

▪

이어서 읽기: "Human: The words on this picture are {left texts}. Continue reading the text. AI: {right texts}"

•

핵심 포인트 생성 (Key Points Generation Task)

◦

"Human: Identify some key points in this picture. AI: {key points}"

Instruction Data Resources

•

문서 (Documents)

◦

DocVQA

◦

InfographicsVQA (InfoVQA)

◦

DeepForm

◦

Kleister Charity (KLC)

•

표 (Tables)

◦

WikiTableQuestions (WTQ)

◦

TabFact

•

차트 (Charts)

◦

ChartQA

•

자연 이미지 (Natural Images)

◦

TextVQA

◦

TextCaps

•

웹페이지 스크린샷 (WebPage Screenshots)

◦

VisualMRC

핵심 요약

•

Instruction Tuning을 활용하여 OCR 없이 범용 Visually-Situated Language Understanding 모델 개발

•

기존 대규모 사전학습 없이 저비용으로 다양한 도메인 데이터셋을 결합하여 학습

•

Text Reading 및 Key Points Generation 보조 작업 추가로 텍스트 인식 & 의미 이해 능력 강화

•

*다양한 도메인(문서, 표, 차트, 자연 이미지, 웹페이지)**에서 통합된 지시 형식으로 학습 수행

5 Experiments

6 Conclusion

•

MLLMs 를 활용한 OCR-Free 범용 VLU 모델

•

Unified Instruction-Tuning 형식 데이터셋

•

Text Reading Task & Key Points Generation Task

•

Shape-Adaptive Cropping Module

•

10개 데이터셋 중 8개에서 SOTA OCR-Free 성능 달성

•

Limitation

◦

multi-page

◦

더 효율적인 크롭 인코딩 방식 연구 필요

◦

local 이미지를 동일한 방식으로 디코딩하는 문제

◦

COT 생성 연구

Made with Slashpage