Sign In

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด ๊ทธ๋ฃน ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต(RL) ๋ฐฉ๋ฒ•๋ก ์€ ์ตœ์ข… ๊ฒฐ๊ณผ์— ๊ธฐ๋ฐ˜ํ•œ ๊ฑฐ์นœ ๊ถค์  ์ˆ˜์ค€์˜ ๊ธฐ์—ฌ๋„ ํ• ๋‹น์— ์˜์กดํ•˜์—ฌ ๊ฐœ๋ณ„ ์Šคํ…์˜ ๊ธฐ์—ฌ๋„๋ฅผ ์ •ํ™•ํžˆ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์•ˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋กค์•„์›ƒ ๊ถค์ ์„ ํ†ตํ•ฉ๋œ ์ƒํƒœ-์ „์ด ๊ทธ๋ž˜ํ”„๋กœ ์ง‘๊ณ„ํ•˜๊ณ , ๊ทธ๋ž˜ํ”„์— ์ธ์ฝ”๋”ฉ๋œ ์ „์—ญ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ์ƒํƒœ์—์„œ ๋ชฉํ‘œ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ถ”์ •ํ•˜๋Š” GraphGPO๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. GraphGPO๋Š” ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜์˜ ์žฅ์ (advantage)์„ ์ถ”์ •ํ•˜์—ฌ ๊ฐ ์ „ํ™˜(edge)์— ๊ธฐ์—ฌ๋„๋ฅผ ํ• ๋‹นํ•จ์œผ๋กœ์จ ํ›ˆ๋ จ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ถค์  ์ˆ˜์ค€์˜ ๋‹จ์ˆœํ•œ ๊ธฐ์—ฌ๋„ ํ• ๋‹น์—์„œ ๋ฒ—์–ด๋‚˜, ์ƒํƒœ-์ „์ด ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ๊ฐœ๋ณ„ ์Šคํ…์˜ ๊ฐ€์น˜๋ฅผ ๋”์šฑ ์ •๊ตํ•˜๊ฒŒ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์‹คํŒจํ•œ ๊ถค์  ์†์—์„œ๋„ ์œ ์˜๋ฏธํ•œ ์Šคํ…์˜ ๊ธฐ์—ฌ๋„๋ฅผ ๋ฐœ๊ตดํ•˜์—ฌ ํ•™์Šต ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ƒํƒœ-์ „์ด ๊ทธ๋ž˜ํ”„ ๊ตฌ์ถ• ๋ฐ ์ •๋ณด ํ™œ์šฉ์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ ์ฆ๊ฐ€ ๊ฐ€๋Šฅ์„ฑ.
๐Ÿ‘