Sign In

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ CLIP์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์ธ์‹ ๋Šฅ๋ ฅ๊ณผ MLLM์˜ ๋ฏธ์„ธํ•œ ๋ถ„๋ฅ˜ ๋Šฅ๋ ฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ, ๋ฐฉ๋Œ€ํ•œ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹์—์„œ Few-shot/Zero-shot ์ธ์‹ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ ์ž RAR(Retrieving And Ranking augmented MLLMs) ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. RAR์€ CLIP ๊ธฐ๋ฐ˜์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฒ€์ƒ‰๊ธฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ช…์‹œ์ ์ธ ๋ฒ”์ฃผ๋ณ„ ๊ธฐ์–ต์„ ๊ตฌ์ถ•ํ•˜๊ณ , ์ถ”๋ก  ์‹œ ๊ฒ€์ƒ‰๋œ ๊ฒฐ๊ณผ๋“ค์„ MLLM์œผ๋กœ ์ˆœ์œ„๋ฅผ ๋งค๊ฒจ ์ตœ์ข… ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ ๋ฏธ์„ธ ๋ถ„๋ฅ˜์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ  ๋ชจ๋ธ์˜ ํฌ๊ด„์ ์ธ ์ง€์‹ ๊ธฐ๋ฐ˜์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ-์–ธ์–ด ์ธ์‹ ์ž‘์—…์—์„œ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
CLIP๊ณผ MLLM์˜ ์žฅ์ ์„ ์œตํ•ฉํ•˜์—ฌ ๋ฏธ์„ธํ•˜๊ณ  ๋ฐฉ๋Œ€ํ•œ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ Few-shot/Zero-shot ์ธ์‹ ์„ฑ๋Šฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ช…์‹œ์ ์ธ ์™ธ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์ƒ‰ ๋ฐ MLLM ๊ธฐ๋ฐ˜ ์ˆœ์œ„ ๊ฒฐ์ • ๊ณผ์ •์„ ํ†ตํ•ด MLLM์˜ ์ปจํ…์ŠคํŠธ ์ฐฝ ์ œ์•ฝ ๋ฐ ๋ณต์žก์„ฑ ์ฆ๊ฐ€ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
5๊ฐœ์˜ ๋ฏธ์„ธ ์‹œ๊ฐ ์ธ์‹ ๋ฒค์น˜๋งˆํฌ, 11๊ฐœ์˜ Few-shot ์ด๋ฏธ์ง€ ์ธ์‹ ๋ฐ์ดํ„ฐ์…‹, 2๊ฐœ์˜ ๊ฐ์ฒด ํƒ์ง€ ๋ฐ์ดํ„ฐ์…‹์—์„œ Zero-shot ์ธ์‹ ์„ฑ๋Šฅ์˜ ์ƒ๋‹นํ•œ ํ–ฅ์ƒ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์˜ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ ๋ฐ ์™ธ๋ถ€ ๊ฒ€์ƒ‰๊ธฐ์˜ ์„ฑ๋Šฅ์ด ์ „์ฒด ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๋ถ„์„์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘