Sign In

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Created by
  • Haebom
Category
Empty

์ €์ž

Yankai Yang, Yancheng Long, Hongyang Wei, Wei Chen, Tianke Zhang, Kaiyu Jiang, Haonan Fan, Changyi Liu, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด ๋ณด์ƒ ๋ชจ๋ธ์€ ๋ณต์žกํ•œ ์‹œ๊ฐ์  ํŽธ์ง‘ ์ž‘์—…์—์„œ ์ „์—ญ์  ์˜๋ฏธ๋ก ์  ์ผ๊ด€์„ฑ๊ณผ ์•”๋ฌต์ ์ธ ๋…ผ๋ฆฌ์  ์ œ์•ฝ์„ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„ ํ˜ธ๋„ ํ•™์Šต๊ณผ ์–ธ์–ด ๋ชจ๋ธ๋ง์„ ๊ณต์œ  ๋น„์ „-์–ธ์–ด ๋ฐฑ๋ณธ์—์„œ ๊ณต๋™์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” Joint Reward Modeling (JRM)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. JRM์€ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์˜๋ฏธ๋ก ์  ๋ฐ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํšจ์œจ์ ์ธ ํŒ๋ณ„์  ํ‘œํ˜„์œผ๋กœ ๋‚ด์žฌํ™”ํ•˜์—ฌ ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ํ‰๊ฐ€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
JRM์€ ํšจ์œจ์„ฑ๊ณผ ์˜๋ฏธ๋ก ์  ์ดํ•ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ์ธก๋ฉด์„ ๋ชจ๋‘ ๊ฐœ์„ ํ•˜์—ฌ ์‹œ๊ฐ์  ๋ณด์ƒ ๋ชจ๋ธ๋ง ๋ถ„์•ผ์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๊ณต๋™ ํ•™์Šต ๋ฐฉ์‹์€ ํŠนํžˆ ๋ณต์žกํ•œ ์‹œ๊ฐ์  ํŽธ์ง‘๊ณผ ๊ฐ™์ด ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ์ค‘์š”ํ•œ ์ž‘์—…์—์„œ ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ด ์—ฐ๊ตฌ๋Š” downstream ์˜จ๋ผ์ธ ๊ฐ•ํ™” ํ•™์Šต์˜ ์•ˆ์ •์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ ์‹ค์ œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
JRM์ด ํ•™์Šตํ•˜๋Š” ๋‚ด๋ถ€ ์ถ”๋ก  ๊ณผ์ •์˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์ด๋‚˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘