Sign In

Autoregressive Direct Preference Optimization

Created by
  • Haebom
Category
Empty

์ €์ž

Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด ์ง์ ‘ ์„ ํ˜ธ๋„ ์ตœ์ ํ™”(DPO) ๋ฐฉ๋ฒ•๋ก ์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๋ฉฐ, ์‘๋‹ต ์ˆ˜์ค€์˜ Bradley-Terry(BT) ๋ชจ๋ธ์ด ์•”๋ฌต์ ์œผ๋กœ๋งŒ ์ž๊ธฐํšŒ๊ท€์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•œ ์ ์„ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ œ์•ˆ๋œ Autoregressive DPO(ADPO)๋Š” BT ๋ชจ๋ธ ์ ์šฉ ์ „์— ๋ช…์‹œ์ ์œผ๋กœ ์ž๊ธฐํšŒ๊ท€ ๊ฐ€์ •์„ ํ†ตํ•ฉํ•˜์—ฌ, DPO ๋ชฉํ‘œ ํ•จ์ˆ˜์˜ ๋กœ๊ทธ-์‹œ๊ทธ๋ชจ์ด๋“œ ์™ธ๋ถ€๋กœ ํ•ฉ์‚ฐ ์—ฐ์‚ฐ์„ ์ด๋™์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ์†์‹ค ํ•จ์ˆ˜ ํ˜•ํƒœ๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค. ADPO๋Š” ์ด๋ก ์  ๋ถ„์„์„ ํ†ตํ•ด ํ† ํฐ ๊ธธ์ด $\mu$์™€ ํ”ผ๋“œ๋ฐฑ ๊ธธ์ด $\mu'$๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๊ธธ์ด ์ฒ™๋„๋ฅผ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•˜๊ณ  LLM ์„ ํ˜ธ๋„ ์ตœ์ ํ™”์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ธฐ์กด DPO์˜ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ํ™•์žฅํ•˜์—ฌ ์ž๊ธฐํšŒ๊ท€์  ํŠน์„ฑ์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•จ์œผ๋กœ์จ LLM ์„ ํ˜ธ๋„ ์ •๋ ฌ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
DPO ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„ ์‹œ ํ† ํฐ ๊ธธ์ด์™€ ํ”ผ๋“œ๋ฐฑ ๊ธธ์ด์˜ ๊ตฌ๋ถ„์ด ์„ ํ˜ธ๋„ ์ตœ์ ํ™”์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นจ์„ ์ด๋ก ์ ์œผ๋กœ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ADPO์˜ ์‹ค์ œ LLM์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ์‹คํ—˜ ๋ฐ ๋ถ„์„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘