Sign In

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์‹œ๊ฐ-์–ธ์–ด ๋ชจ๋ธ(LVLMs)์—์„œ ํ…์ŠคํŠธ ์ƒ์„ฑ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง์— ๋”ฐ๋ผ ์‹œ๊ฐ ์ •๋ณด์— ๋Œ€ํ•œ ์ฃผ์˜๊ฐ€ ํฌ์„๋˜๋Š” "์‹œ๊ฐ ์‹ ํ˜ธ ํฌ์„(Visual Signal Dilution)" ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ "์˜๊ตฌ ์‹œ๊ฐ ๋ฉ”๋ชจ๋ฆฌ(Persistent Visual Memory, PVM)" ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. PVM์€ LVLMs์˜ ํ”ผ๋“œํฌ์›Œ๋“œ ๋„คํŠธ์›Œํฌ(FFN)์™€ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ž‘๋™ํ•˜์—ฌ ์‹œ๊ฐ ์ž„๋ฒ ๋”ฉ์— ๋Œ€ํ•œ ๊ฑฐ๋ฆฌ ๋ถˆ๋ณ€์˜ ๊ฒ€์ƒ‰ ๊ฒฝ๋กœ๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ, ๊นŠ์€ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์‹œ๊ฐ ์‹ ํ˜ธ ์–ต์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, PVM์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ฆ๊ฐ€ ์—†์ด Qwen3-VL ๋ชจ๋ธ์—์„œ ํ‰๊ท  ์ •ํ™•๋„๋ฅผ ๊พธ์ค€ํžˆ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ํŠนํžˆ ๋ณต์žกํ•œ ์ถ”๋ก  ์ž‘์—…์—์„œ ์ง€์†์ ์ธ ์‹œ๊ฐ ์ธ์‹์„ ์š”๊ตฌํ•˜๋Š” ๊ฒฝ์šฐ์— ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LVLMs์˜ ์žฅ๊ธฐ์ ์ธ ์‹œ๊ฐ ์ •๋ณด ๊ธฐ์–ต ๋Šฅ๋ ฅ ํ–ฅ์ƒ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ธฐ์กด ๋ชจ๋ธ ๊ตฌ์กฐ์— ๊ฒฝ๋Ÿ‰ ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํšจ์œจ์ ์ธ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ๊ฐ€๋Šฅํ•จ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ PVM ๋ชจ๋“ˆ์ด ํŠน์ • ๋ณต์žกํ•œ ์ถ”๋ก  ์ž‘์—…์— ๋” ํšจ๊ณผ์ ์ธ์ง€, ํ˜น์€ ๋ชจ๋“  ์ข…๋ฅ˜์˜ ์‹œ๊ฐ-์–ธ์–ด ์ž‘์—…์— ์ผ๋ฐ˜์ ์œผ๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ•œ์ง€์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘