Sign In

Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

Created by
  • Haebom
Category
Empty

์ €์ž

Pierriccardo Olivieri, Fausto Lasca, Alessandro Gianola, Matteo Papini

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ์ƒํƒœ ๊ณต๊ฐ„์„ ๊ฐ€์ง„ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(MDP)์—์„œ ๋น„-๋งˆ๋ฅด์ฝ”ํ”„์  ๋ณด์ƒ์„ ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ช…์„ธํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์€ ์ˆ ์–ด(predicate)๋ฅผ ๋‹จ์ˆœํžˆ ์ฐธ/๊ฑฐ์ง“ ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹Œ ์ž„์˜์˜ 1์ฐจ ์ด๋ก ์— ๋Œ€ํ•œ 1์ฐจ ๊ณต์‹์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” LTLfMT(Linear Temporal Logic Modulo Theories over finite traces)๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ํƒœ์Šคํฌ๋ฅผ ๋น„์ •ํ˜• ๋ฐ ์ด์ข… ๋ฐ์ดํ„ฐ ๋„๋ฉ”์ธ์—์„œ ํ†ตํ•ฉ์ ์ด๊ณ  ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ช…์„ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ธฐ์กด LTLf๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ํ‘œํ˜„๋ ฅ์œผ๋กœ ๋ณต์žกํ•˜๊ณ  ์ด์งˆ์ ์ธ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์˜ ํƒœ์Šคํฌ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ช…์„ธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
โ€ข
1์ฐจ ๋…ผ๋ฆฌ ๋ช…์„ธ๋ฅผ ๋ณด์ƒ ๋จธ์‹ ๊ณผ HER(Hindsight Experience Replay)์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณด์ƒ ํฌ์†Œ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
LTLfMT์˜ ํ‘œํ˜„๋ ฅ ์ฆ๊ฐ€๋กœ ์ธํ•œ ์ด๋ก ์ , ๊ณ„์‚ฐ์  ๋ณต์žก์„ฑ ์ฆ๊ฐ€๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ tractableํ•œ LTLfMT ์กฐ๊ฐ์„ ์‹๋ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋น„์„ ํ˜• ์‚ฐ์ˆ  ์ด๋ก ์„ ํ™œ์šฉํ•œ ์—ฐ์† ์ œ์–ด ํ™˜๊ฒฝ์—์„œ์˜ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง„ ํƒœ์Šคํฌ ํ•ด๊ฒฐ์— HER์˜ ๋งž์ถคํ˜• ๊ตฌํ˜„์ด ์ค‘์š”ํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘