Sign In

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Changqian Yu, Kun Gai, Xueqian Wang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ƒ์„ฑ์—์„œ ๊ธฐ์กด ํ๋ฆ„ ๋งค์นญ ๋ฐฉ๋ฒ•๋ก (GRPO)์˜ ์žฅ๋‹จ์  ์†๋„ ์ถ”์ • ์˜ค๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์—ฐ์†์ ์ธ ๋‹จ๊ณ„๋ฅผ ์ฒญํฌ(chunk) ๋‹จ์œ„๋กœ ๋ฌถ์–ด ์ •์ฑ… ์ตœ์ ํ™” ๋‹จ์œ„๋ฅผ ์ฒญํฌ ์ˆ˜์ค€์œผ๋กœ ์ด๋™์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์ธ GCPO(Group Chunking Policy Optimization)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. GCPO๋Š” ํ๋ฆ„ ๋งค์นญ ์ •์ฑ…์„ ์ฒญํฌ ๋‹จ์œ„๋กœ ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ ์žฅ๋‹จ์  ์†๋„ ์ถ”์ • ์˜ค๋ฅ˜์˜ ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ์™„ํ™”ํ•˜๋ฉฐ, ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ƒ์„ฑ ์„ฑ๋Šฅ๊ณผ ์„ ํ˜ธ๋„ ์ •๋ ฌ์—์„œ GRPO ๋Œ€๋น„ ์ตœ๋Œ€ 43%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
ํ๋ฆ„ ๋งค์นญ ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ๊ฐ•ํ™”ํ•™์Šต ์ •์ฑ… ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ ์ฒญํฌ ๋‹จ์œ„ ์ ‘๊ทผ์˜ ์œ ํšจ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ธฐ์กด์˜ ๋‹จ๊ณ„๋ณ„ ์ตœ์ ํ™” ๋ฐฉ์‹์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์žฅ๋‹จ์  ์†๋„ ์ถ”์ • ์˜ค๋ฅ˜ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜์—ฌ ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฌผ์˜ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ ์ œ์•ˆ๋œ GCPO์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๋ฐ ๋‹ค์–‘ํ•œ ์ƒ์„ฑ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘