Sign In

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Sangin Lee, Yukyung Choi

๐Ÿ’ก ๊ฐœ์š”

๋Œ€๊ทœ๋ชจ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ์—์„œ ์‹œ๊ฐ ํ† ํฐ์€ ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์œ ๋ฐœํ•˜๋ฉฐ, ๊ธฐ์กด์˜ ํ† ํฐ ๊ฐ€์ง€์น˜๊ธฐ ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ์— ๋”ฐ๋ผ ์ค‘์š”๋„๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ํ”ฝ์…€ ์ ‘์ง€ ์ž‘์—…์— ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ CLIP ๋ถ„์„์„ ํ†ตํ•ด ์ฐธ์กฐ ์˜์—ญ ๋‚ด ์‹œ๊ฐ ํ† ํฐ์ด ํ…์ŠคํŠธ ํ‘œํ˜„๊ณผ ๋‚ฎ์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ด๋Š” ์ ์— ์ฐฉ์•ˆํ•˜์—ฌ, ํ›ˆ๋ จ ์—†์ด ํ…์ŠคํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹œ๊ฐ ํ† ํฐ์„ ๊ฐ€์ง€์น˜๊ธฐํ•˜๋Š” LiteLVLM ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. LiteLVLM์€ ํ…์ŠคํŠธ-์‹œ๊ฐ ์œ ์‚ฌ๋„ ์ˆœ์œ„๋ฅผ ์—ญ์ „์‹œ์ผœ ์ฐธ์กฐ ์˜์—ญ์„ ํฌํ•จํ•˜๋Š” ์‹œ๊ฐ ํ† ํฐ์„ ํšจ๊ณผ์ ์œผ๋กœ ์œ ์ง€ํ•˜๊ณ , ๋ช…ํ™•ํ•œ ์ „๊ฒฝ-๋ฐฐ๊ฒฝ ๋ถ„๋ฆฌ๋ฅผ ์œ„ํ•œ ์ปจํ…์ŠคํŠธ ํ† ํฐ์„ ๋ณต๊ตฌํ•˜์—ฌ ํšจ์œจ์ ์ธ ํ”ฝ์…€ ์ ‘์ง€๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
ํ”ฝ์…€ ์ ‘์ง€ ์ž‘์—…์—์„œ ํ…์ŠคํŠธ-์‹œ๊ฐ ์œ ์‚ฌ๋„ ์—ญ์ „์„ ํ†ตํ•œ ํšจ๊ณผ์ ์ธ ํ† ํฐ ๊ฐ€์ง€์น˜๊ธฐ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ›ˆ๋ จ ๋ฐ ํŒŒ์ธํŠœ๋‹ ์—†์ด๋„ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ ํšจ์œจ์„ฑ ์ฆ๋Œ€๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
๋‹ค์–‘ํ•œ ํ† ํฐ ์˜ˆ์‚ฐ์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ 5% ์ด์ƒ์˜ ์„ฑ๋Šฅ ์šฐ์œ„๋ฅผ ๋ณด์ด๋ฉฐ, 90%์˜ ์„ฑ๋Šฅ ์œ ์ง€์™€ 22% ์†๋„ ํ–ฅ์ƒ, 2.3๋ฐฐ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ์†Œ ํšจ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ LiteLVLM์€ ํ”ฝ์…€ ์ ‘์ง€๋ผ๋Š” ํŠน์ • ์ž‘์—…์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹ค๋ฅธ ๋น„์ „-์–ธ์–ด ์ž‘์—…์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์€ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘