Sign In

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Created by
  • Haebom
Category
Empty

์ €์ž

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์‹œ๊ฐ ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•˜๊ณ  ๋ชจํ˜ธํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ์ง€๋ฆฌ ์œ„์น˜ ์ถ”์ •(geo-localization) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด SpotAgent๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. SpotAgent๋Š” ๋Œ€๊ทœ๋ชจ ์‹œ๊ฐ-์–ธ์–ด ๋ชจ๋ธ(LVLM)์ด ์™ธ๋ถ€ ๋„๊ตฌ(์›น ๊ฒ€์ƒ‰, ์ง€๋„ ๋“ฑ)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ ๋‹จ์„œ๋ฅผ ๋Šฅ๋™์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๊ณ  ๊ฒ€์ฆํ•˜๋Š” ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ์ถ”๋ก  ๋ฐฉ์‹์„ ๋„์ž…ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๋ชจ๋ธ์˜ ํ™˜๊ฐ(hallucination) ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ณ  ์ •ํ™•ํ•˜๋ฉฐ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ์ง€๋ฆฌ ์œ„์น˜ ์ถ”์ • ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LVLM์˜ ์ง€๋ฆฌ ์œ„์น˜ ์ถ”์ • ์„ฑ๋Šฅ์„ ์‹ค์ œ ํ™˜๊ฒฝ์— ์ ํ•ฉํ•˜๋„๋ก ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์™ธ๋ถ€ ๋„๊ตฌ ์—ฐ๋™ ๋ฐ ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ์ถ”๋ก ์ด ํšจ๊ณผ์ ์ž„์„ ๋ณด์—ฌ์ค€๋‹ค.
โ€ข
SFT, ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ํˆด ์‚ฌ์šฉ ํ•™์Šต, RL ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋Š” 3๋‹จ๊ณ„ ํ›„ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์ด ๋ชจ๋ธ์˜ ํˆด ํ˜ธ์ถœ ๋Šฅ๋ ฅ๊ณผ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ฐœ์ „์‹œํ‚จ๋‹ค.
โ€ข
Spatially-Aware Dynamic Filtering ์ „๋žต์€ RL ํ•™์Šต ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ๊ณต๊ฐ„์  ์–ด๋ ค์›€์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํ•™์Šต ์ƒ˜ํ”Œ์„ ์šฐ์„ ์ˆœ์œ„ํ™”ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•œ๋‹ค.
โ€ข
์ œ์•ˆ๋œ SpotAgent๋Š” ํ™˜๊ฐ์„ ์ค„์ด๊ณ  ์ •ํ™•ํ•œ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ, ๋ณต์žกํ•˜๊ณ  ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ์ถ”๊ฐ€์ ์ธ ๊ฐ•๊ฑด์„ฑ ํ…Œ์ŠคํŠธ ๋ฐ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ํƒ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ๋‹ค.
๐Ÿ‘