POST

All

Product

Team

Tech

이메일 데이터에 답이 있다: M&A 사업부문 플랫폼 개발 이야기

Lighthouse

Product
Tech

2025/10/02 2:54 PM

Chain of Thought - AI 추론은 환상인가?

Lighthouse

Tech

2025/09/24 2:52 PM

AI 요약 기술의 끝판왕? Graph RAG로 질문에 완벽히 답하다

Lighthouse

Tech

2025/09/10 3:54 PM

모델 정확도만 높인다고 사용자가 쓸까?

Lighthouse

Product

2025/09/08 3:15 PM

프론트엔드 브랜치 전략

Eunyoung Lee

Tech

2025/07/24 4:46 PM

DocVLM: Make Your VLM an Efficient Reader

최

최윤진

Tech

2025/03/06 10:57 AM

Python 3.10 신규 문법 : Parenthesized context managers와 PEG Parser

seunghoChoe

Tech

2025/03/14 2:56 PM

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

최

최윤진

Tech

2025/03/20 4:23 PM

[팀 소개편] KPMG Lighthouse는 어떤 팀인가요?

Lighthouse

Team

2025/03/21 1:21 PM

[챕터 소개편] Backend Chapter를 소개합니다

Lighthouse

Team

2025/03/21 1:37 PM

[챕터 소개편]Frontend Chapter를 소개합니다

Lighthouse

Team

2025/03/21 1:43 PM

[챕터 소개편] AI Chapter를 소개합니다

Lighthouse

Team

2025/03/21 1:43 PM

Python 3.11 신규 문법: ExceptionGroups와 asyncio.TaskGroup

donggyun_woo

Tech

2025/09/10 4:09 PM

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Created by

E
Eunyoung Lee

Created at

2024/12/12 5:03 PM

아티클 개요

•

논문명: NV-Embed: Improved Techniques for Training LLMs
as Generalist Embedding Models

•

링크 : https://arxiv.org/pdf/2405.17428

•

출간일 : 2024.05

•

출간 학회 : -

•

저자 : Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

•

소속 : NVIDIA

•

인용 수 : 31

•

모델: https://huggingface.co/nvidia/NV-Embed-v2

Abstract

•

NV-Embed 모델은 다양한 아키텍처 디자인과 학습 절차를 통해 LLM의 성능을 다목적 임베딩 모델로서 크게 향상시켰으며, LLM의 단순성과 재현성은 유지

•

모델 아키텍처: latent attention layer

◦

pooled embeddings

◦

mean/EOS 풀링에 비해 검색 및 다운스트림 작업의 정확도를 향상

◦

represenation learning을 향상시키기 위해, contrastive learning 동안 LLM의 causal attention mask를 제거

•

모델 학습: two-stage contrastive instruction-tuning

•

MTEB 랭킹에서 1위

◦

공개된 데이터만 사용

Introduction

•

디코더로만 이루어진 LLM이 양방향 임베딩 모델의 성능을 뛰어넘을 수 있다는 것을 보여줬지만, GPT-4로 만든 대용량의 합성 데이터로 LLM을 파인튜닝하여 일반인 사용 불가

•

NV-Embed contributions

◦

latent attention layer

▪

LLM2Vec보다 더 간단하고 효과적

▪

mixed training objective를 통한 추가 학습 단계

◦

two-stage contrastive instruction-tuning

▪

Mistral-7B

검색 데이터셋에 대한 instruction을 포함한 contrastive learning, in-batch negatives와 큐레이션된 hard negative examples를 활용

다양한 non-retrieval 데이터셋을 instruction tuning에 혼합하여, non-retrieval task의 정확도를 향상시킬 뿐만 아니라 검색 성능도 개선

•

in-batch negative sample이 retrieval이 아닌 태스크에서는 misleading할 수 있기 때문에 스테이지 2에서는 in-batch negative training 해제

Related Work

•

디코더로만 이루어진 LLM은 일반적 목적의 임베딩 태스크에서 bidirectional 모델 보다 성능이 떨어진다고 생각해옴

◦

unidirectional attention이 representation learning에 한계가 있음

◦

LLM의 확장으로 인해 매우 고차원 임베딩이 생성되는데, 차원의 저주(거리 측정이나 유사도 계산이 비효율적이거나 부정확해질 수 있는 문제) 문제가 있을 수도

•

디코더 → 임베딩 모델 사례

◦

text-embedding-3-large: 프리트레인 된 GPT-3 모델에 continued contrastive training

◦

E5-Mistral: constrastive learning with task-specific instructions, synthetic data from the proprietary GPT-4

◦

LLM2Vec: 공개된 데이터로 LLM 학습

◦

Gecko: 작은 bidirectional embedding model을 decoder-only LLM에 distill

◦

GritLM: 텍스트 임베딩과 생성을 모델 하나에 통합

•

SFR-Embedding-Mistral: non-retrieval과 retrieval 데이터셋을 섞어서 파인튜닝, NV-Embed와 비슷

◦

차이점 1: NV-Embed는 공개된 데이터만 쓰고 합성 데이터 쓰지 않음

◦

차이점 2: batching 방법이 다름, SFR-Embedding_Mistral은 태스크-동일 배칭을 사용하지만 NV-Embed는 다양한 태스크의 샘플을 배치로 섞어서 potential “zigzag” gradient update를 방지

Method

Bidirectional Attention

•

디코더 블록의 causal mask는 디코더가 auto-regressive 텍스트 생성 동안 이전 포지션만 참조할 수 있도록 하여 정보 유출을 막아옴

•

단방향이 모델의 표현력을 막아오고 있음(비슷한 사이즈의 BERT나 T5에 비하면 GPT는 자연어 이해 성능 떨어짐)

•

LLM2Vec: masked token prediction으로 양방향 어텐션 학습

•

NV-Embed: contastive learning 때 causal attention mask 삭제

Latent Attention Layer

•

bidirectional 모델은 보통 mean pooling 사용, decoder-only LLM 기반 임베딩 모델은 보통 <EOS> 토큰 사용

◦

mean pooling 한계: key phrase의 중요한 정보가 희석됨

◦

<EOS> pooling 한계: recency bias(마지막 토큰의 임베딩에 크게 의존)

•

latent attention layer

•

디코더의 마지막 hidden 레이어를 쿼리(Q)로 정의

•

Q는 $\mathbb{R}^{l \times d}$ 공간에 속하며, $l$은 시퀀스 길이, $d$는 hidden dimension

•

Q는 latent array인 $K = V \in \mathbb{R}^{r \times d}$ 를 참조하도록 보내짐

•

K와 V는 훈련 가능한 dictionary로 더 나은 표현을 얻기 위해 사용되며, r은 사전 내 latent의 개수

•

cross-attention의 아웃풋은 $O \in \mathbb{R}^{l \times d}$ 이고 $O = softmax(QK^T)V$

•

O는 사이에 GELU 활성화 함수가 있는 두 개의 linear transformation인 일반적인 MLP를 거침

•

512개의 latent(r)와 8개의 헤드를 가진 multi attention을 사용한 latent attention layer

•

MLP layer 후에 mean pooling을 적용하여 전체 시퀀스의 임베딩을 얻음

Two-stage Instruction-Tuning

•

Instruction tuning: 지시를 따르고 RAG 수행을 위해 실행

•

in-batch negative: retrieval과 non-retrieval을 모두 잘하는 일반적인 임베딩을 만들기

◦

dense-embedding 기반 retriever 학습에 효과적

◦

미니 배치 내의 passage가 negative가 아니어서 분류나 클러스터링에는 misleading할 수 있음

•

two-stage instruction tuning

다양한 검색 데이터셋에 대해 in-batch negative와 선별된 hard-negative 예시들을 활용하여 지시사항이 포함된 contrastive training 수행

retrieval과 non-retrieval 데이터셋을 결합하여 in-batch negative 기법을 적용하지 않고 contrastive instruction-tuning 수행

◦

retrieval이 다른 태스크보다 어렵기 때문에 retrieval 위주로 하고 다른 태스크는 stage 2에 통합

Training Data

•

$q^{+}_{inst}$ = Instruct : task_definition Query: $q^+$

•

Public retrieval datasets

◦

hard-negative를 갖고 있지 않기 때문에 인코더 기반 임베딩 모델을 파인튜닝해서 hard-netative를 고를 수 있도록 함

•

Public non-retrieval datasets

◦

MTEB 벤치마크의 세가지 서브태스크: classification, clustering, semantic similarity

◦

contrastive training을 위해 query, positive document, hard negative documents로 전처리

◦

STS 데이터셋에서는 BM25로 hard negative를 찾아냄

Experiments

•

Experiment Details

◦

PEFT와 LoRA로 파인튜닝

◦

Mistral 7B

◦

어텐션 마스크를 causal에서 bidirectional로 대체

◦

512 lantent의 latent attention, 4096 hidden dimensions size, 8 multi-head attentions

◦

LoRA: rank 16, alpha 32, droupout rate 0.1

◦

Adam optimizer, 500 warm-up steps, 2e-5 learning rate

◦

128 batch size, each batch composed of a query paired with 1 positive and 7 hard negative documents

◦

bfloat16, 512 maximum sequence length

•

MTEB Results

•

Abalation Study

◦

causal attention VS bidirectional attention

▪

bidirectional mask가 causal mask보다 점수 높음

◦

pooling methods

▪

last, mean, latent-attention, self-attention pooling type 비교

▪

mean pooling이 last token보다 점수 높음

•

last token은 recency bias에 영향을 받아 마지막 토큰에 높은 의존성을 보임

▪

self-attention은 LLM의 임베딩 능력에 영향을 끼치지 않음

•

LLM은 이미 많은 self-attention layer를 갖고 있기 때문에 추가한다고 성능이 좋아지진 않음

▪

latent-attention layer는 성능 향상시킴

•

dictionary learning이 더 풍부한 표현 가능하게 함

•

output embedding을 평균화하면서 중요한 정보가 희석되거나 손실되는 것을 방지

Conclusion

•

latent attention layer

•

causal attention mask 제거

•

two-stage contrastive learning

•

MTEB, BEIR 최고 점수

Made with Slashpage