Sign In

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

Created by
  • Haebom
Category
Empty

์ €์ž

Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๊ธฐ์กด ๊ฐ•ํ™”ํ•™์Šต์—์„œ ๋‹จ์ผ ์Šค์นผ๋ผ ๋ณด์ƒ ๋Œ€์‹  ๋‹ค์ฐจ์›์ ์ด๊ณ  ๊ตฌ์กฐํ™”๋œ ๋ฃจ๋ธŒ๋ฆญ ๊ธฐ๋ฐ˜ ํ‰๊ฐ€๋ฅผ ์‚ฌ์šฉํ•˜๋Š” RLRR ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹์ด ๊ณ ์ •๋œ ๊ฐ€์ค‘์น˜๋กœ ๋ฒกํ„ฐ ๋ณด์ƒ์„ ์„ ํ˜• ์••์ถ•ํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, ์ œ์•ˆ๋œ ARL-RR์€ ๊ฐ ์‹œ๋งจํ‹ฑ ๋ฃจ๋ธŒ๋ฆญ ๋ฉ”ํƒ€ ํด๋ž˜์Šค๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ ๊ณ ์ •๋œ ์Šค์นผ๋ผํ™”๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ ์„ฑ๋Šฅ๊ณผ ํ›ˆ๋ จ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ํŠนํžˆ HealthBench ๋ฐ์ดํ„ฐ์…‹ ์‹คํ—˜์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ธฐ์กด RLRR์˜ ๊ณ ์ • ๊ฐ€์ค‘์น˜ ์Šค์นผ๋ผํ™” ๋ฐฉ์‹์ด ๋ณด์ƒ ์ฐจ์› ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•˜๊ณ  ์ธ๊ณต์ ์ธ ์ ์ˆ˜ ์„ค๊ณ„์— ๋ฏผ๊ฐํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ฐ ๋ฉ”ํƒ€ ํด๋ž˜์Šค๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ARL-RR ๋ฐฉ์‹์€ ๋ณด์ƒ ์ง‘๊ณ„ ์‹œ ๋ถ„์‚ฐ ์ถ•์†Œ ํšจ๊ณผ๋ฅผ ์œ ๋„ํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•˜๋ฉฐ, ๋™์ ์œผ๋กœ ๋‹ค์Œ ๋ฉ”ํƒ€ ํด๋ž˜์Šค๋ฅผ ์„ ํƒํ•˜๋Š” ์ ˆ์ฐจ๋ฅผ ํ†ตํ•ด ์ค‘์š”ํ•œ ๋ชฉํ‘œ์— ์ง‘์ค‘ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๊ทœ๋ชจ์—์„œ ์Šค์นผ๋ผํ™”๋œ ๋ฐฉ๋ฒ•๋ณด๋‹ค uniformly ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํ›ˆ๋ จ ํšจ์œจ์„ฑ ๋˜ํ•œ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
HealthBench ๋ฐ์ดํ„ฐ์…‹ ์™ธ์˜ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์—์„œ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘