Sign In

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Created by
  • Haebom
Category
Empty

์ €์ž

Lei Jiang, Chunzhao Xie, Tongxuan Liu, Yuting Zeng, jinrong Guo, Yunheng Shen, Weizhe Huang, Jing Li, Xiaohua Xu

๐Ÿ’ก ๊ฐœ์š”

๋Œ€๊ทœ๋ชจ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(LVLM)์€ ๋†€๋ผ์šด ๋Šฅ๋ ฅ์„ ๋ณด์ด์ง€๋งŒ, ํ™˜๊ฐ(hallucination) ํ˜„์ƒ์œผ๋กœ ์ธํ•ด ์‹ค์งˆ์ ์ธ ๋ฐฐํฌ์— ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์‹œ๊ฐ์  ์ฃผ์˜(visual attention)์˜ ์ €ํ•˜๊ฐ€ ํ™˜๊ฐ์˜ ์ฃผ์š” ์›์ธ์ž„์„ ๋ฐํžˆ๊ณ , ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ์—†๋Š”(training-free) ํ”„๋ ˆ์ž„์›Œํฌ์ธ TARAC(Temporal Attention Real-time Accumulative Connection)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. TARAC์€ ๊ณผ๊ฑฐ์˜ ์‹œ๊ฐ์  ์ฃผ์˜ ์ •๋ณด๋ฅผ ๋™์ ์œผ๋กœ ๋ˆ„์ ํ•˜๊ณ  ์žฌ์ฃผ์ž…ํ•˜์—ฌ ์‹œ๊ฐ์  ๊ทผ๊ฑฐ(visual grounding)๋ฅผ ์œ ์ง€ํ•จ์œผ๋กœ์จ ํ™˜๊ฐ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ž…๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
ํ›ˆ๋ จ ์—†์ด ๊ธฐ์กด LVLM์˜ ํ™˜๊ฐ ํ˜„์ƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ์™„ํ™”ํ•˜๋ฉฐ, ํŠนํžˆ ์ƒ์„ฑ ๊ณผ์ •์—์„œ์˜ ์‹œ๊ฐ์  ์ฃผ์˜ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ฒฝ๋Ÿ‰์˜ ํ”Œ๋Ÿฌ๊ทธ ์•ค ํ”Œ๋ ˆ์ด(plug-and-play) ๋ชจ๋“ˆ๋กœ์„œ, ๊ธฐ์กด ๋ชจ๋ธ์— ์‰ฝ๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ ์ถ”๋ก  ์‹œ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค.
โ€ข
๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๋ฐ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ์ฒจ๋‹จ(state-of-the-art) ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ํ™˜๊ฐ ๋ฌธ์žฅ ๊ฐ์†Œ ๋ฐ ์ธ์‹ ์ ์ˆ˜ ํ–ฅ์ƒ ๋“ฑ์˜ ๊ตฌ์ฒด์ ์ธ ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” TARAC์ด ์ธ์ง€ ๊ฐ•ํ™” ๋ฉ”์ปค๋‹ˆ์ฆ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜์Œ์„ ์–ธ๊ธ‰ํ•˜์ง€๋งŒ, ์‹ค์ œ ์ธ์ง€ ๊ณผํ•™์  ์›๋ฆฌ์™€์˜ ๋” ๊นŠ์€ ์—ฐ๊ฒฐ์„ฑ์ด๋‚˜, TARAC์ด ํ™˜๊ฐ์„ ์ค„์ด๋Š” ์ •ํ™•ํ•œ ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์ด๋ก ์  ๋ถ„์„์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘