Sign In

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu, David Doermann

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์–‘ํ•œ ๋น„๋””์˜ค ์•„์นด์ด๋ธŒ์—์„œ ์งˆ๋ฌธ ๊ด€๋ จ ์ฆ๊ฑฐ๋ฅผ ์ฐพ๊ณ  ๊ฐ ์ฃผ์žฅ์„ ์ถœ์ฒ˜์™€ ์—ฐ๊ฒฐํ•ด์•ผ ํ•˜๋Š” ์‹ค์ œ ๋‰ด์Šค ์‚ฌ๊ฑด์— ๋Œ€ํ•œ ๋ฉ€ํ‹ฐ๋น„๋””์˜ค ์งˆ์˜์‘๋‹ต(VQA) ์‹œ์Šคํ…œ์˜ ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด CRAFT(Critic-Refined Adaptive Key-Frame Targeting)๋ผ๋Š” ๋™์  ํ‚คํ”„๋ ˆ์ž„ ์„ ํƒ, ๋‹ค๊ตญ์–ด ๋Œ€์ฒด ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ˜ ๋น„๋””์˜ค๋ณ„ ASR, ๊ทธ๋ฆฌ๊ณ  ์ฃผ์žฅ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฒ€์ฆํ•˜๊ณ  ์ˆ˜์ •ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋น„ํ‰ ๋ฃจํ”„๋ฅผ ๊ฒฐํ•ฉํ•œ ์ฟผ๋ฆฌ ์กฐ๊ฑด๋ถ€ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
CRAFT๋Š” MAGMaR 2026 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์šฐ์ˆ˜ํ•œ ํ‰๊ท  ์ ์ˆ˜(0.739), ์ฐธ์กฐ ๋ฆฌ์ฝœ(0.810), ์ธ์šฉ F1(0.635)์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ ๋ฉ€ํ‹ฐ๋น„๋””์˜ค VQA ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
โ€ข
MAGMaR ์Šคํƒ€์ผ์˜ WikiVideo ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ(0.823 Avg)์„ ๋ณด์—ฌ, ์ œ์•ˆ๋œ ์ฃผ์žฅ์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ฆ๊ฑฐ ์ทจํ•ฉ ๋ฐฉ์‹์ด MAGMaR ์™ธ์˜ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—๋„ ์ผ๋ฐ˜ํ™”๋จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์›์ž์  ์ฃผ์žฅ, ASR, ๋น„ํ‰ ๋ฃจํ”„๊ฐ€ ๊ธฐ๋ณธ ์ฟผ๋ฆฌ ์กฐ๊ฑด๋ถ€ ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์˜ ์‹ค์ œ ๋‰ด์Šค ์‚ฌ๊ฑด์—์„œ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ๊ณผ ๋‹ค์–‘ํ•œ ์–ธ์–ด ๋ฐ ๋น„๋””์˜ค ํ˜•์‹์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ์€ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘