Sign In

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ๋‹ค์ค‘ ๋ชจ๋“œ ์ถ”๋ก  ์ž‘์—…์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ† ํฐ ๋˜๋Š” ์ „์ฒด ์‘๋‹ต ์‹œํ€€์Šค ๋‹จ์œ„์˜ ์ •์ฑ… ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์ธ Segment-Aligned Policy Optimization (SAPO)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SAPO๋Š” ์ถ”๋ก  ๊ณผ์ •์„ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹จ๊ณ„๋ณ„ ๊ตฌ์กฐ๋กœ ๋ณด๊ณ , ์ด๋Ÿฌํ•œ '์ถ”๋ก  ์„ธ๊ทธ๋จผํŠธ'๋ฅผ ์ •์ฑ… ์—…๋ฐ์ดํŠธ์˜ ๊ธฐ๋ณธ ๋‹จ์œ„๋กœ ์‚ผ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค ๋” ์•ˆ์ •์ ์ด๊ณ  ํšจ๊ณผ์ ์ธ ํ•™์Šต์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋Œ€ํ‘œ์ ์ธ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์—์„œ ์ƒ๋‹นํ•œ ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์ถ”๋ก  ๊ณผ์ •์˜ ๋‚ด์žฌ๋œ ๋‹จ๊ณ„๋ณ„ ๊ตฌ์กฐ์— ๋งž์ถฐ ๊ฐ•ํ™”ํ•™์Šต ์—…๋ฐ์ดํŠธ ๋‹จ์œ„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ๋‹ค์ค‘ ๋ชจ๋“œ ์ถ”๋ก  ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค๋Š” ์ ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
SAPO๋Š” ์ถ”๋ก  ์„ธ๊ทธ๋จผํŠธ ๋‹จ์œ„์˜ ๊ฐ€์น˜ ์ถ”์ • ๋ฐ ์ด์  ๊ณ„์‚ฐ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ๊ธฐ์กด ํ† ํฐ ๋˜๋Š” ์‹œํ€€์Šค ๋‹จ์œ„ ์ตœ์ ํ™” ๋Œ€๋น„ ๋” ๋‚˜์€ ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ์ผ๊ด€์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ํ–ฅํ›„ ๋ณต์žกํ•œ ์ถ”๋ก  ์ž‘์—…์—์„œ ํšจ์œจ์ ์ด๊ณ  ์˜๋ฏธ๋ก ์ ์œผ๋กœ ๊ธฐ๋ฐ˜ํ•œ ์ •์ฑ… ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์•ž์œผ๋กœ ๋” ๋‹ค์–‘ํ•œ ์ถ”๋ก  ์ž‘์—… ๋ฐ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์— SAPO๋ฅผ ์ ์šฉํ•˜๊ณ , ์„ธ๊ทธ๋จผํŠธ ๊ฒฝ๊ณ„๋ฅผ ์ž๋™์œผ๋กœ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘