Sign In

Difficulty-Estimated Policy Optimization

Created by
  • Haebom
Category
Empty

์ €์ž

Yu Zhao, Fan Jiang, Tianle Liu, Bo Zeng, Yu Liu, Longyue Wang, Weihua Luo

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์ถ”๋ก  ๋ชจ๋ธ(LRM) ํ›ˆ๋ จ ์‹œ ๋ฐœ์ƒํ•˜๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ ์‹ ํ˜ธ ๊ฐ์‡  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Difficulty-Estimated Policy Optimization (DEPO)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. DEPO๋Š” ์˜จ๋ผ์ธ ๋‚œ์ด๋„ ์ถ”์ •๊ธฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ•™์Šต ์ž ์žฌ๋ ฅ์ด ๋†’์€ ์ƒ˜ํ”Œ์— ๊ณ„์‚ฐ ์ž์›์„ ์ง‘์ค‘ํ•จ์œผ๋กœ์จ ๋กค์•„์›ƒ ๋น„์šฉ์„ ์ตœ๋Œ€ 2๋ฐฐ๊นŒ์ง€ ์ค„์ด๋ฉด์„œ๋„ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณ ์„ฑ๋Šฅ ์ถ”๋ก  ๋ชจ๋ธ ํ›ˆ๋ จ์˜ ๊ณ„์‚ฐ ๋ถ€๋‹ด์„ ๋‚ฎ์ถ”๊ณ  ์ง€์† ๊ฐ€๋Šฅํ•œ ์ถ”๋ก  ํ™•์žฅ ๊ฒฝ๋กœ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ํ•™์Šต ์ž ์žฌ๋ ฅ์„ ๋™์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ  ํ•„ํ„ฐ๋งํ•จ์œผ๋กœ์จ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ถ”๋ก  ๋ชจ๋ธ ํ›ˆ๋ จ์˜ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ ˆ๊ฐํ•˜์—ฌ ๊ณ ์„ฑ๋Šฅ ๋ชจ๋ธ ๊ฐœ๋ฐœ์˜ ์ ‘๊ทผ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
โ€ข
์˜จ๋ผ์ธ ๋‚œ์ด๋„ ์ถ”์ •๊ธฐ์˜ ์ •ํ™•๋„์™€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด DEPO์˜ ์ „์ฒด์ ์ธ ํšจ๊ณผ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘