Sign In

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

Created by
  • Haebom
Category
Empty

์ €์ž

Ruoran Li, Xinghua Zhang, Haiyang Yu, Shitong Duan, Xiang Li, Wenxin Xiang, Chonghua Liao, Xudong Guo, Yongbin Li, Jinli Suo

๐Ÿ’ก ๊ฐœ์š”

์žฅ๊ธฐ์  ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ปจํ…์ŠคํŠธ ํฌ๊ธฐ ์ฆ๊ฐ€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์€ ์™ธ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋“ˆ์— ์˜์กดํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๋Š” MemPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. MemPO๋Š” ์—์ด์ „ํŠธ(์ •์ฑ… ๋ชจ๋ธ)๊ฐ€ ์ž์ฒด์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”์•ฝํ•˜๊ณ  ๊ด€๋ฆฌํ•˜๋„๋ก ํ•จ์œผ๋กœ์จ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๊ณ  ํƒœ์Šคํฌ ์ˆ˜ํ–‰ ๋Šฅ๋ ฅ์„ ์œ ์ง€ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰์„ ๋Œ€ํญ ์ค„์ด๋ฉด์„œ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์—์ด์ „ํŠธ๊ฐ€ ์Šค์Šค๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌํ•˜๊ณ  ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ ์žฅ๊ธฐ์  ์ƒํ˜ธ์ž‘์šฉ์—์„œ์˜ ์„ฑ๋Šฅ ๋ฐ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ•œ๋‹ค.
โ€ข
๋ฉ”๋ชจ๋ฆฌ ํšจ๊ณผ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์‹ ์šฉ ํ• ๋‹น ๊ฐœ์„ ์€ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์„ ํƒ์ ์œผ๋กœ ๋ณด์กดํ•˜๊ณ  ๋ถˆํ•„์š”ํ•œ ์ •๋ณด ์ฒ˜๋ฆฌ๋ฅผ ์ค„์ด๋Š” ๋ฐ ํšจ๊ณผ์ ์ด๋‹ค.
โ€ข
์ œ์•ˆ๋œ MemPO๋Š” ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ ํ•จ๊ป˜ ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ค๋Š” ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘์—ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ๋ฐ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.
๐Ÿ‘