POST

All

Product

Team

Tech

DocVLM: Make Your VLM an Efficient Reader

최

최윤진

Tech

2025/03/06 10:57 AM

Python 3.10 신규 문법 : Parenthesized context managers와 PEG Parser

seunghoChoe

Tech

2025/03/14 2:56 PM

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

최

최윤진

Tech

2025/03/20 4:23 PM

[팀 소개편] KPMG Lighthouse는 어떤 팀인가요?

Lighthouse

Team

2025/03/21 1:21 PM

[챕터 소개편] Backend Chapter를 소개합니다

Lighthouse

Team

2025/03/21 1:37 PM

[챕터 소개편]Frontend Chapter를 소개합니다

Lighthouse

Team

2025/03/21 1:43 PM

[챕터 소개편] AI Chapter를 소개합니다

Lighthouse

Team

2025/03/21 1:43 PM

DocVLM: Make Your VLM an Efficient Reader

Created by

최
최윤진

Created at

2025/03/06 10:57 AM

Abstract

•

We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights.

•

Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM.

•

Model agnostic 하게 DocVLM 적용 가능 (InternVL2, Qwen2-VL, LLaVA-OneVision)

•

Contribution

◦

모델 독립적 OCR 정보 통합 방법 제안

◦

OCR 정보를 64개 쿼리로 압축해 연산 부담 감소

◦

다양한 VLM에서 성능 향상 확인 (특히 448×448 입력 환경)

◦

멀티페이지 문서에서도 강력한 성능 (DUDE 제로샷, MP-DocVQA SOTA 달성)

1. Introduction

•

tension between resolution requirements and computational efficiency

•

OCR 텍스트를 직접 언어 모델 프롬프트에 넣는 방식은 시각적 맥락과 레이아웃 정보를 놓치며, 긴 시퀀스로 인해 지연과 비용 증가 초래

•

최근 VLM들은 이미지 토큰 수 줄이기 위한 기법을 도입했지만, 성능 저하 문제 발생

2. Related Work

•

Document Representation Compression

◦

Q-former, Resampler, TokenPacker, DocComperesor 등 Document 에 대한 Representation 방법론들.

3. Our Method

3.1 Architecture

•

VLM 아키텍쳐를 보완하는 형태

◦

an OCR encoder

▪

DocFormerV2 (weight 공개된 게 없는 듯 함) / T5-based encoder-decoder

▪

Visual Branch 는 사용 안함.

◦

a query compression mechanism that distills this information into a compact representation.

•

Query Compression Mechanism

◦

DocVLM은 OCR 정보를 효과적으로 통합하기 위해 instruction-aware query compression 메커니즘 도입

◦

OCR 인코더 출력을 64개 learned queries로 압축해 LLM 입력 시퀀스 길이 대폭 축소

◦

learned queries 는 OCR encoder embeddings’ distribution를 기반으로 랜덤 초기화

◦

Encoder Input OCR 임베딩 (텍스트+좌표) + instruction 임베딩 + learnable query

◦

인코더 출력 중 학습 쿼리 부분만 유지하여 VLM 히든 차원에 맞춰 Projection 후 Visual Token 과 결합해 LLM에 전달

◦

입력 시퀀스 길이 절감 → 효율적 처리 가능 + 고정 토큰 예산 내에서 시각 정보에 더 많은 토큰 할당 가능

•

Full Image Processing with Image Resizing (e.g., Qwen2-VL)

•

Patch-Based Processing with Controlled Tile Count (e.g., InternVL2 [18])

•

Full-Scale Processing with Feature Downsampling (e.g., LlaVA-OneVision [35])

3.2 Training Strategy

•

VLM Frozen

•

train only the newly introduced OCR components

◦

the learnable queries

◦

the OCR encoder

◦

projection layer

•

Stage I: OCR-LLM Alignment

◦

이미지 입력 없이 OCR data 만 사용하여 학습

▪

별도 OCR 엔진 필요함

◦

OCR components를 LLM input space에 정합, sequence length 단축 및 학습 효율 개선

◦

Text-centric datasets 활용

◦

초기에는 learnable queries와 projection layer만 학습, OCR encoder는 고정

◦

이후 OCR encoder를 UnFrozen하여 전체 모델과 정합

•

Stage II: Vision Alignment

◦

Visual encoder에서 추출한 visual features와 OCR information을 함께 사용

◦

Learnable queries 수가 적을수록 OCR components가 visual modality를 더 효과적으로 보완

◦

Vision-focused datasets 추가하여 학습

◦

원래 VLM weights는 유지하지만 prompt tuning으로 편향 가능성 존재 → 다양한 datasets 활용 필요

3.3 Multipage Document Extension

•

Global Encoding / Page Wise → learnable query

•

DUDE, MP-DocVQA

4. Experiments

4.1 Experimental Setting

•

LLaVAOneVision / InternVL2 / Qwen2-VL

•

datasets

◦

document understanding (DocVQA, InfoVQA)

▪

https://huggingface.co/datasets/lmms-lab/DocVQA?library=datasets

◦

scene text analysis (ST-VQA , TextVQA, OCR-VQA)

▪

https://huggingface.co/datasets/vikhyatk/st-vqa

▪

https://huggingface.co/datasets/lmms-lab/textvqa

▪

https://huggingface.co/datasets/howard-hou/OCR-VQA

◦

specialized tasks(ChartQA, TextCaps, TATDQA)

▪

https://huggingface.co/datasets/lmms-lab/ChartQA

▪

https://huggingface.co/datasets/lmms-lab/TextCaps

▪

https://huggingface.co/datasets/vidore/tatdqa_train

◦

visual-centric datasets(COCO Caption, VQA-V2)

▪

https://huggingface.co/datasets/lmms-lab/COCO-Caption2017

▪

https://huggingface.co/datasets/lmms-lab/VQAv2

•

metric

◦

ANLS - Average Normalized Levenshtein Similarity

◦

https://github.com/shunk031/ANLS

◦

https://arxiv.org/abs/2402.03848

4.2 State-of-the-art Comparisons

4.4 Scaling to Multipage Documents

•

multi page comprehension

5. Ablation Study

•

Impact of OCR Encoding Strategies

•

(1) inserting raw OCR words in the original VLM

•

(2) using DocVLM uncompressed OCR encodings

•

(3) DocVLM compressed OCR encodings with 64 learned queries

6. Conclusions

•

DocVLM은 다양한 VLM에 효과적으로 통합되어 문서 읽기 성능을 강화함

•

많은 비전 토큰 사용 없이도 성능 개선 가능

•

토큰이 제한된 상황에서는, 일부 토큰을 OCR 정보에 할당하는 것이 시각적 처리에만 사용하는 것보다 성능 우수

•

DocVLM의 압축 메커니즘은 단일 페이지뿐 아니라 다중 페이지 문서에서도 효과적

•

MP-DocVQA 벤치마크에서 64 토큰으로 최고 성능 달성

•

실제 환경에서 연산 효율성을 높이는 실용적 문서 이해 솔루션으로 자리매김

Made with Slashpage