Sign In

Beyond Reward: A Bounded Measure of Agent Environment Coupling

Created by
  • Haebom
Category
Empty

์ €์ž

Wael Hafez, Cameron Reid, Amit Nazeri

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์‹ค์ œ ๊ฐ•ํ™”ํ•™์Šต(RL) ์—์ด์ „ํŠธ์˜ ๋ฐฐํฌ ์‹œ ๋ฐœ์ƒํ•˜๋Š” ๋ถ„ํฌ ๋ณ€ํ™” ๋ฌธ์ œ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด, ์—์ด์ „ํŠธ์™€ ํ™˜๊ฒฝ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ํšจ๊ณผ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ธก์ •ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ง€ํ‘œ์ธ '์ด์ค‘์˜ˆ์ธก์„ฑ(bipredictability, P)'์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด์ค‘์˜ˆ์ธก์„ฑ์€ ๊ด€์ธก-ํ–‰๋™-๊ฒฐ๊ณผ ๋ฃจํ”„ ๋‚ด ๊ณต์œ  ์ •๋ณด ๋น„์œจ์„ ์ธก์ •ํ•˜๋ฉฐ, ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” '์ •๋ณด ๋””์ง€ํ„ธ ํŠธ์œˆ(IDT)' ๋ชจ๋‹ˆํ„ฐ๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ด์ค‘์˜ˆ์ธก์„ฑ์€ ๊ธฐ์กด ๋ณด์ƒ ๊ธฐ๋ฐ˜ ๋ชจ๋‹ˆํ„ฐ๋ง๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ์—์ด์ „ํŠธ-ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ ์ €ํ•˜๋ฅผ ๊ฐ์ง€ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์ œ์•ˆ๋œ ์ด์ค‘์˜ˆ์ธก์„ฑ(P) ์ง€ํ‘œ๋Š” ์—์ด์ „ํŠธ์™€ ํ™˜๊ฒฝ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ํ’ˆ์งˆ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ธก์ •ํ•˜์—ฌ, ๋ณด์ƒ์ด๋‚˜ ์ž‘์—… ์ง€ํ‘œ๋กœ๋Š” ํฌ์ฐฉํ•˜๊ธฐ ์–ด๋ ค์šด ์กฐ๊ธฐ ์ƒํ˜ธ์ž‘์šฉ ์˜ค๋ฅ˜๋ฅผ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ •๋ณด ๋””์ง€ํ„ธ ํŠธ์œˆ(IDT)์€ ์ด์ค‘์˜ˆ์ธก์„ฑ์„ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋ฉฐ, ์‹ค์ œ ๋ถ„ํฌ ๋ณ€ํ™” ์ƒํ™ฉ์—์„œ ๋ณด์ƒ ๊ธฐ๋ฐ˜ ๋ชจ๋‹ˆํ„ฐ๋ง๋ณด๋‹ค ๋†’์€ ํƒ์ง€์œจ๊ณผ ๋‚ฎ์€ ์ง€์—ฐ ์‹œ๊ฐ„์œผ๋กœ ์—์ด์ „ํŠธ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ์ด๋ก ์  ๊ทผ๊ฑฐ๋ฅผ ๊ฐ–์ถ˜ ์ƒˆ๋กœ์šด ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋ฉฐ, ํ–ฅํ›„ ๋ฐฐ์น˜ ๊ฐ•ํ™”ํ•™์Šต ์‹œ์Šคํ…œ์˜ ์ž๊ฐ€ ๊ทœ์ œ(self-regulation)๋ฅผ ์œ„ํ•œ ๊ธฐ๋ฐ˜ ์‹ ํ˜ธ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” MuJoCo HalfCheetah ํ™˜๊ฒฝ์—์„œ SAC ๋ฐ PPO ์—์ด์ „ํŠธ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ถ„ํฌ ๋ณ€ํ™”์— ๋Œ€ํ•œ ์ด์ค‘์˜ˆ์ธก์„ฑ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๋ฐ ๋‹ค๋ฅธ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ ํ™˜๊ฒฝ์—์„œ์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘