Sign In

How Log-Barrier Helps Exploration in Policy Optimization

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Leonardo Cesani, Matteo Papini, Marcello Restelli

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ๋ฐด๋”ง(SGB) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ตœ์  ํ–‰๋™ ํ™•๋ฅ ์ด 0์—์„œ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋น„ํ˜„์‹ค์ ์ธ ๊ฐ€์ • ํ•˜์— ์ „์—ญ ์ตœ์  ์ •์ฑ…์œผ๋กœ ์ˆ˜๋ ดํ•จ์„ ์ง€์ ํ•˜๋ฉฐ, ๋ช…์‹œ์ ์ธ ํƒ์ƒ‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๋ถ€์žฌ๋ฅผ ๋ฌธ์ œ์ ์œผ๋กœ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ œ์•ˆ๋œ ๋กœ๊ทธ ์žฅ๋ฒฝ(Log-Barrier) ์ •๊ทœํ™”๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ํ™”๋œ ์ •์ฑ…์— ์ตœ์†Œํ•œ์˜ ํƒ์ƒ‰์„ ๊ตฌ์กฐ์ ์œผ๋กœ ๊ฐ•์ œํ•จ์œผ๋กœ์จ SGB์˜ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๋ณด๊ฐ•ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ LB-SGB ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ SGB์™€ ๋™์ผํ•œ ์ƒ˜ํ”Œ ๋ณต์žก์„ฑ์„ ๊ฐ€์ง€๋ฉด์„œ๋„ ๋น„ํ˜„์‹ค์ ์ธ ๊ฐ€์ • ์—†์ด ์ˆ˜๋ ดํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋กœ๊ทธ ์žฅ๋ฒฝ ์ •๊ทœํ™”๋Š” ๊ธฐ์กด SGB ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํƒ์ƒ‰ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ํ˜„์‹ค์ ์ธ ํ•™์Šต ํ™˜๊ฒฝ์—์„œ๋„ ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋กœ๊ทธ ์žฅ๋ฒฝ๊ณผ ์ž์—ฐ ์ •์ฑ… ๊ฒฝ์‚ฌ(Natural Policy Gradient)๋Š” ์ •์ฑ… ๊ณต๊ฐ„์˜ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜๊ณ  ํ”ผ์…” ์ •๋ณด(Fisher information)๋ฅผ ์ œ์–ดํ•œ๋‹ค๋Š” ์ ์—์„œ ์ด๋ก ์ ์ธ ์—ฐ๊ฒฐ์„ฑ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ LB-SGB๋Š” ๊ธฐ์กด SGB๋ณด๋‹ค ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ฆด ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ ๊ฐœ์„ ํ•ด์•ผ ํ•  ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.
๐Ÿ‘