Sign In

Language-Guided Invariance Probing of Vision-Language Models

Created by
  • Haebom
Category
Empty

์ €์ž

Jae Joong Lee

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๊ธฐ์กด ์‹œ๊ฐ-์–ธ์–ด ๋ชจ๋ธ(VLM)์˜ ์–ธ์–ด์  ๋ณ€ํ™”์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์ธ Language-Guided Invariance Probing (LGIP)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. LGIP๋Š” ์˜๋ฏธ๋ฅผ ์œ ์ง€ํ•˜๋Š” ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ฆˆ์— ๋Œ€ํ•œ ๋ถˆ๋ณ€์„ฑ๊ณผ ์˜๋ฏธ๋ฅผ ๋ฐ”๊พธ๋Š” ํŽธ์ง‘์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด์˜ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ ์ง€ํ‘œ๋กœ๋Š” ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์šด VLM์˜ ์–ธ์–ด์  ๊ฒฌ๊ณ ์„ฑ์„ ์ง„๋‹จํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ๊ฒฐ๊ณผ, ์ผ๋ถ€ ์ตœ์‹  VLM๋“ค์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์€ ์˜๋ฏธ ๋ณ€ํ™”์— ์ทจ์•ฝํ•œ ๋ชจ์Šต์„ ๋ณด์˜€์œผ๋ฉฐ ์ด๋Š” ์ผ๋ฐ˜์ ์ธ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ๋Š” ๊ฐ์ง€ํ•˜๊ธฐ ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
VLM์˜ ์–ธ์–ด์  ๊ฒฌ๊ณ ์„ฑ์„ ํ‰๊ฐ€ํ•˜๋Š” ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํฌ์˜ ํ•„์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
CLIP, OpenCLIP ๋“ฑ ์ฃผ์š” VLM์˜ ์–ธ์–ด์  ๋ฏผ๊ฐ๋„ ์ฐจ์ด๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ๋น„๊ตํ•˜์—ฌ ๋ชจ๋ธ ์„ ํƒ ๋ฐ ๊ฐœ๋ฐœ์— ๋Œ€ํ•œ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ธฐ์กด ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ์ง€ํ‘œ๋งŒ์œผ๋กœ๋Š” VLM์˜ ์‹ ๋ขฐ์„ฑ์„ ์™„์ „ํžˆ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ์ ์„ ์‹œ์‚ฌํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ์ง„๋‹จ ๋„๊ตฌ์˜ ์ค‘์š”์„ฑ์„ ๋ถ€๊ฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ ๋ฒค์น˜๋งˆํฌ๋Š” ์ฃผ๋กœ ๊ฐ์ฒด, ์ƒ‰์ƒ, ๊ฐœ์ˆ˜ ๋“ฑ ํŠน์ • ์œ ํ˜•์˜ ์˜๋ฏธ ํŽธ์ง‘์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์–ด, ๋” ๋ณต์žกํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์–ธ์–ด์  ๋ณ€ํ™”์— ๋Œ€ํ•œ VLM์˜ ๋ฐ˜์‘์„ ํƒ๊ตฌํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘