Sign In

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Created by
  • Haebom
Category
Empty

์ €์ž

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์ธ๊ฐ„์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ ๊ฐ๊ฐ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์„ธ์ƒ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๋Š” ๊ธฐ์กด์˜ ์˜ด๋‹ˆ๋น„๋””์˜ค ๋ชจ๋ธ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด OmniVideo-R1์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐ•ํ™” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, ์ด๋Š” ์ฟผ๋ฆฌ ๊ธฐ๋ฐ˜ ์ง‘์ค‘ ํ•™์Šต๊ณผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ฃผ์˜ ์œตํ•ฉ์„ ํ†ตํ•ด ๋‹ค์ค‘ ๊ฐ๊ฐ ์ •๋ณด ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์€ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ๊ทธ ํšจ๊ณผ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋‹ค์ค‘ ๊ฐ๊ฐ ์ •๋ณด(์‹œ๊ฐ, ์ฒญ๊ฐ ๋“ฑ)๋ฅผ ๋”์šฑ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ์˜ด๋‹ˆ๋น„๋””์˜ค ์ดํ•ด๋ฅผ ๊ฐ•ํ™”ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ž๊ธฐ ์ง€๋„ ํ•™์Šต ๋ฐ ๋Œ€์กฐ ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ '๊ฐ•ํ™”'ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ OmniVideo-R1์€ ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์˜ด๋‹ˆ๋น„๋””์˜ค ์ดํ•ด ๋ถ„์•ผ์˜ ๋ฐœ์ „์— ๊ธฐ์—ฌํ•  ์ž ์žฌ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
(ํ•œ๊ณ„์  ๋˜๋Š” ํ–ฅํ›„ ๊ณผ์ œ) ๋…ผ๋ฌธ ์ดˆ๋ก๋งŒ์œผ๋กœ๋Š” ๊ตฌ์ฒด์ ์ธ ํ•œ๊ณ„์ ์ด๋‚˜ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ๋ช…ํ™•ํžˆ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. (์˜ˆ: ์‹ค์ œ ์ ์šฉ ์‹œ์˜ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ, ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๊ณผ์ ํ•ฉ ๊ฐ€๋Šฅ์„ฑ, ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ์˜ ํ™•์žฅ์„ฑ ๋“ฑ)
๐Ÿ‘