Sign In

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Hongbo Jin, Rongpeng Zhu, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ๋ณต์žกํ•œ ์ถ”๋ก  ๋Šฅ๋ ฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๊ธฐ์กด์˜ ์ˆœ์ฐจ์  ์ˆ˜์ค€ ํฌ๋ ˆ๋”ง ํ• ๋‹น ๋ฐฉ์‹์ด ๊ฐ€์ง„ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์ธ Distribution Guided Policy Optimization (DGPO)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. DGPO๋Š” KL ๋ฐœ์‚ฐ ํŽ˜๋„ํ‹ฐ ๋Œ€์‹  Hellinger ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฐ ์ˆ˜์ค€ ํƒ์ƒ‰์„ ์•ˆ์ „ํ•˜๊ฒŒ ์œ ๋„ํ•˜๊ณ , ์—”ํŠธ๋กœํ”ผ ๊ฒŒ์ดํŒ… ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ๋ถˆํ™•์‹ค์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์‹ค์ œ ์ถ”๋ก  ๋‹จ๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ ๋น„์šฉ ์ฆ๊ฐ€ ์—†์ด ์„ธ๋ฐ€ํ•œ ํฌ๋ ˆ๋”ง ์žฌ๋ถ„๋ฐฐ๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ์ถ”๊ฐ€์ ์ธ ๊ฐ€์น˜ ๋„คํŠธ์›Œํฌ ์—†์ด๋„ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LLM์˜ ์žฅ๊ธฐ ์ถ”๋ก  ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ๋‹จ๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‹๋ณ„ํ•˜๊ณ  ๋ณด์ƒํ•˜์—ฌ ์ถ”๋ก  ํ’ˆ์งˆ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๊ธฐ์กด์˜ KL ๋ฐœ์‚ฐ ํŽ˜๋„ํ‹ฐ๋กœ ์ธํ•œ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ๋ณด์ˆ˜์ ์ธ ํƒ์ƒ‰ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ๋”์šฑ ๋‹ค์–‘ํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ๋ฐœ๊ฒฌํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ๋น„ํ‰๊ฐ€(critic-free) ๊ฐ•ํ™”ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ์‹ค์ œ ์ ์šฉ ์‹œ ๋‹ค์–‘ํ•œ LLM ์•„ํ‚คํ…์ฒ˜ ๋ฐ ๋ณต์žกํ•œ ์ถ”๋ก  ํƒœ์Šคํฌ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘