Sign In

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์˜ค๋””์˜ค-๋น„์ฃผ์–ผ ์งˆ์˜์‘๋‹ต(AVQA) ๋ชจ๋ธ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ต์ฐจ ๋ชจ๋‹ฌ ๊ฐ„์„ญ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 'Separate First, Fuse Later (SFFL)'๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SFFL์€ ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„๋กœ ๋…๋ฆฝ์ ์ธ ์ถ”๋ก  ๊ณผ์ •์„ ๊ฑฐ์น˜๋„๋ก ์œ ๋„ํ•˜๊ณ , ์ตœ์ข… ๋‹จ๊ณ„์—์„œ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์ƒํ˜ธ ๊ฐ„์„ญ์œผ๋กœ ์ธํ•œ ํ™˜๊ฐ(hallucination)์„ ์ค„์ž…๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ AVQA ๋ฒค์น˜๋งˆํฌ์—์„œ ์ „๋ฐ˜์ ์ธ ์ •ํ™•๋„์™€ ๊ฐ•๊ฑด์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์˜ค๋””์˜ค์™€ ๋น„์ฃผ์–ผ ์ •๋ณด์˜ ๋ณด์™„์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋ฉด์„œ๋„, ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ๊ณ ์œ ํ•œ ์ •๋ณด๋ฅผ ๊ฐ„์„ญ ์—†์ด ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ์ถ”๋ก  ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ฐ•ํ™” ํ•™์Šต๊ณผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์„ ํ˜ธ๋„ ๋ ˆ์ด๋ธ”์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์–ด๋–ค ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ •๋ณด๋ฅผ ๋” ์ค‘์š”ํ•˜๊ฒŒ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋Š”์ง€ ํ•™์Šต์‹œํ‚ค๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ๋ถ„๋ฆฌ ์ถ”๋ก  ๊ณผ์ •์—์„œ์˜ ์ •๋ณด ๋ณด์กด ๋ฐ ์œตํ•ฉ ๋‹จ๊ณ„์—์„œ์˜ ํšจ๊ณผ์ ์ธ ํ†ตํ•ฉ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์„ค๊ณ„๊ฐ€ ์ค‘์š”ํ•˜๋ฉฐ, ์ด ๊ณผ์ •์—์„œ์˜ ์ถ”๊ฐ€์ ์ธ ์ตœ์ ํ™” ๊ฐ€๋Šฅ์„ฑ์„ ํƒ์ƒ‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘