Sign In

When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Yann Berthelot, Philippe Preux, Riad Akrour

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์—ฐ์† ์ œ์–ด ๋ฌธ์ œ์—์„œ ์œ ๋Šฅํ•˜์ง€๋งŒ ์ตœ์ ์ด ์•„๋‹Œ ์ „๋ฌธ๊ฐ€ ์ œ์–ด๊ธฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต(RL) ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋น„๊ต ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ์ „๋ฌธ๊ฐ€ ๋ถˆ์•ˆ์ •์„ฑ ์กฐ๊ฑด ํ•˜์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ๋“ค์˜ ์ž ์žฌ์  ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ๋ฐํžˆ๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ „๋ฌธ๊ฐ€ ํ™œ์šฉ ์‹œ์ ์„ ๊ฒฐ์ •ํ•˜๋Š” ์˜์‚ฌ ๊ฒฐ์ • ๊ทœ์น™์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์ „๋ฌธ๊ฐ€ ํ™œ์šฉ์˜ ํ•จ์ • ๋ฐœ๊ฒฌ: ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ ๊ฐ„๊ณผ๋˜์—ˆ๋˜ ๋น„ํŒ์ž ๋งน์ , ์ž”์ฐจ ํฌํ™”, ๋ฒ„ํผ ์˜ค์—ผ๊ณผ ๊ฐ™์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ๊ทœ๋ช…ํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ๊ธฐ๋ฐ˜ RL์˜ ์‹ ๋ขฐ์„ฑ์— ๋Œ€ํ•œ ๊ฒฝ๊ฐ์‹ฌ์„ ๋†’์ž…๋‹ˆ๋‹ค.
โ€ข
์ฒด๊ณ„์ ์ธ ๋น„๊ต ๋ฐ ์˜์‚ฌ ๊ฒฐ์ • ๋„๊ตฌ ์ œ๊ณต: ๊ณต์œ ๋œ ๋ฐฑ๋ณธ, ์—„๊ฒฉํ•œ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ, ๋‹ค์ˆ˜์˜ ์‹œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ฐฉ๋ฒ•๋ก ์˜ ์„ฑ๋Šฅ์„ ๊ณต์ •ํ•˜๊ฒŒ ๋น„๊ตํ•˜๊ณ , ์ „๋ฌธ๊ฐ€ ํ’ˆ์งˆ, ํƒœ์Šคํฌ ์ข…๋ฃŒ, ์„ญ๋™ ์œ ํ˜•๊ณผ ๊ฐ™์€ ์‚ฌ์ „ ๊ด€์ฐฐ ๊ฐ€๋Šฅํ•œ ์ง€ํ‘œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ „๋ฌธ๊ฐ€ ํ™œ์šฉ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์‹ค์šฉ์ ์ธ ๊ทœ์น™์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„ ๋˜๋Š” ์˜ˆ์‚ฐ ์ œ์•ฝ ์—ฌ๋ถ€: ํ˜„์žฌ ์ตœ์  ์ „๋ฌธ๊ฐ€์— ๊ฐ€๊นŒ์šด ๊ฒฝ์šฐ์—๋„ ์–ด๋–ค ์งˆ์˜ ์‹œ๊ฐ„ ์ „๋ฌธ๊ฐ€ ๋ฐฉ๋ฒ•๋„ ์ œ์•ˆ๋œ ์˜ˆ์‚ฐ ๋‚ด์—์„œ ์ „๋ฌธ๊ฐ€ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•˜์ง€ ๋ชปํ–ˆ๋Š”๋ฐ, ์ด๊ฒƒ์ด ์ „๋ฌธ๊ฐ€ ๊ธฐ๋ฐ˜ RL์˜ ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„์ธ์ง€ ์•„๋‹ˆ๋ฉด ๋‹จ์ˆœํžˆ ํ•™์Šต ์˜ˆ์‚ฐ์˜ ๋ถ€์กฑ ๋•Œ๋ฌธ์ธ์ง€๋Š” ์•„์ง ๋ถˆ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘