Sign In

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฑฐ๋Œ€ ์–ธ์–ด ๋ชจ๋ธ(MLLM)์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋น„์ „-์–ธ์–ด ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ Modality-Mutual Attention(MMA)์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด MLLM์˜ ์ธ๊ณผ์  ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์ด๋ฏธ์ง€์™€ ๊ฐ™์€ ์ดˆ๊ธฐ ์ •๋ณด๊ฐ€ ํ…์ŠคํŠธ์™€ ๊ฐ™์€ ํ›„๊ธฐ ์ •๋ณด๋กœ๋ถ€ํ„ฐ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๋ฉฐ, MMA๋Š” ์ด๋ฏธ์ง€ ํ† ํฐ์ด ํ…์ŠคํŠธ ํ† ํฐ์— ์–ดํ…์…˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ์œผ๋กœ์จ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ ์—†์ด 12๊ฐœ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด ๋ฒค์น˜๋งˆํฌ์—์„œ ํ‰๊ท  6.2% ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
MLLM์˜ ๊ทผ๋ณธ์ ์ธ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„ ๊ด€์ ์—์„œ ๋น„์ „-์–ธ์–ด ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ Modality-Mutual Attention(MMA)์€ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์—†์ด ๊ธฐ์กด LLM ๋ฐฑ๋ณธ์— ์‰ฝ๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‹œ๋‚˜๋ฆฌ์˜ค์— ํ™•์žฅ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ์ฃผ๋กœ ๋น„์ „-์–ธ์–ด ๋ฐ์ดํ„ฐ์…‹์— ์ง‘์ค‘๋˜์—ˆ์œผ๋‚˜, ํ–ฅํ›„ ์˜ค๋””์˜ค, ๋น„๋””์˜ค ๋“ฑ ๋”์šฑ ๋‹ค์–‘ํ•œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํฌ๊ด„ํ•˜๋Š” ์ผ๋ฐ˜ํ™”๋œ MMA ์„ค๊ณ„์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘