# LiLT:A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

# **LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding**

- _논문명 :  LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding_

- **_링크_**_ :_ [https://arxiv.org/abs/2202.13669](https://arxiv.org/abs/2202.13669)

- **_출간일_**_ : 2022.02_

- **_출간 학회_**_ :  ACL_

- **_저자_**_ : Wang, Jiapeng, Lianwen Jin, and Kai Ding_

- **_소속_**_ :_

    - South China University of Technology, Guangzhou, China

    - IntSig Information Co., Ltd, Shanghai, China

    - INTSIG-SCUT Joint Laboratory of Document Recognition and Understanding, China

    - Peng Cheng Laboratory, Shenzhen, China

- **_인용 수_**_ : 117_

- **_코드_**_ :_

    - [https://github.com/jpWang/LiLT](https://github.com/jpWang/LiLT)

    - [https://huggingface.co/docs/transformers/main/model_doc/lilt](https://huggingface.co/docs/transformers/main/model_doc/lilt)

---

# Abstract

- 문제 의식 : English 에 특화된 Structured Document Understanding (SDU) 모델들만 있음 
- → Multi lingual SDU 모델에 Contribution

    - DLA 태스크를 명확히 말하지 않음.

    - Semantic Entity Recognition (SER), Relation Extraction(RE) 에 한정해서 언급

    - _Paragraph 단위의 SER 이 DLA Task 와 같은 것으로 보임_

    -   [https://huggingface.co/pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384](https://huggingface.co/pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384)

**LILT**

- LILT는 OCR 결과 를 받아서 Language Independent 하게 SDU(SER, RE) 하는 모델

- pretrain 할 때 single language 로 학습함.

- fine tuning 할 때, 다른 언어 사용

- architecture

    - a novel bi-directional attention complementation mechanism (BiACM)

    - mono lingual or multi lingual pretrain texture 모델 사용 (ex RoBERTA, InfoXLM)

- pretrain task

    - MVLM 태스크

    - key point location (KPL)

    - cross- modal alignment identification (CAI) tasks

---

# HuggingFace LiLT 구현

```
from transformers import LiltModel

```

```
@add_start_docstrings(
    "The bare LiLT Model transformer outputting raw hidden-states without any specific head on top.",
    LILT_START_DOCSTRING,
)
class LiltModel(LiltPreTrainedModel):
    def __init__(self, config, add_pooling_layer=True):
        super().__init__(config)
        self.config = config

        self.embeddings = LiltTextEmbeddings(config)
        self.layout_embeddings = LiltLayoutEmbeddings(config)
        self.encoder = LiltEncoder(config)

        self.pooler = LiltPooler(config) if add_pooling_layer else None

        # Initialize weights and apply final processing
        self.post_init()

```

- LiLT 파인튜닝 코드

    - [https://github.com/karndeepsingh/Extract_key_information_Document_understanding/blob/main/Finetuning LiLT Model for Information Extraction from Document Images and PDF.ipynb](https://github.com/karndeepsingh/Extract_key_information_Document_understanding/blob/main/Finetuning%2520LiLT%2520Model%2520for%2520Information%2520Extraction%2520from%2520Document%2520Images%2520and%2520PDF.ipynb)

---

# 데이터셋 추가 설명

- FUNSD

    - 목적: 노이즈가 있는 스캔 문서에서 양식을 이해하기 위한 영어 데이터셋

    - 구성:

        - 총 `199`개의 실제 스캔된 양식

        - 31,485개의 단어 위에 `9,707개의 의미 엔티티가 어노테이션`

        - 149개 훈련용, 50개 테스트용으로 분할

    - 태스크: 의미 엔티티 인식(SER) - 각 단어에 네 가지 미리 정의된 카테고리 중 하나의 레이블을 할당

        - 카테고리: 질문, 답변, 헤더, 기타

    - 특징: 공식 OCR 주석을 직접 사용

- CORD(영수증 키 정보 추출용 영어 데이터셋)

    - 구성: 훈련용 800개, 검증용 100개, 테스트용 100개 영수증

    - 각 영수증에는 사진과 OCR 주석 목록 포함

    - 4개 카테고리 아래 `30개 필드` 정의

    - 태스크: 각 단어에 올바른 필드 레이블 지정

    - 공식 OCR 주석 사용

- EPHOIE (중국어 시험지 )

    - 다양한 텍스트 유형과 레이아웃 분포를 가진 실제 시험지로 구성

    - 구성: 훈련용 1,183개, 테스트용 311개 이미지 (총 1,494개)

    - 10개의 엔티티 카테고리 정의

    - 평가 지표: 엔티티 수준 F1 점수 (RoBERTa, LayoutXLM, LiLT 모델용)

    - 공식 OCR 주석 사용

- RVL-CDIP

    - 영어 문서 분류 데이터셋

    - 400,000개의 흑백 영어 문서 이미지로 구성

    - Text and layout information are extracted by TextIn API

- XFUND

    - 개요: XFUND는 다국어 양식 이해를 위한 데이터셋입니다.

    - 구성:

        - 총 1,393개의 완전히 주석 처리된 양식

        - 7개 언어 포함: 중국어(ZH), 일본어(JA), 스페인어(ES), 프랑스어(FR), 이탈리아어(IT), 독일어(DE), 포르투갈어(PT)

        - 각 언어별로 199개의 양식 (훈련용 149개, 테스트용 50개)

    - 주요 태스크:

        - 의미 엔티티 인식(SER)

        - 관계 추출(RE): 주어진 두 의미 엔티티 간의 관계 예측 (주로 키-값 관계 추출에 초점)

    - 평가:

        - 공식 OCR 결과 사용

For the site tree, see the [root Markdown](https://slashpage.com/kpmg-lighthouse.md).