Sign In

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Created by
  • Haebom
Category
Empty

์ €์ž

Yuqi Xiong, Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์™ธ๋ถ€ ์‹œ๊ฐ ๋ฌธ์„œ๋ฅผ ํ™œ์šฉํ•˜๋Š” VRAG ํ”„๋ ˆ์ž„์›Œํฌ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๊ณ ์ •๋œ ์™ธ๋ถ€ ๋„๊ตฌ ๋Œ€์‹  ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑ๋˜๋Š” ์–ธ์–ด์  ๋„๊ตฌ ์ฒด์ธ์„ ํ†ตํ•ด ๋ฏธ์„ธํ•œ ์‹œ๊ฐ์  ์ถ”๋ก ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” Lang2Act๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Lang2Act๋Š” ์‹œ๊ฐ ์ธ์‹๊ณผ ์ถ”๋ก  ๊ณผ์ •์„ ๋ถ„๋ฆฌํ•˜์ง€ ์•Š๊ณ , RL ๊ธฐ๋ฐ˜์˜ 2๋‹จ๊ณ„ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ๊ณ ํ’ˆ์งˆ์˜ ์–ธ์–ด์  ๋„๊ตฌ๋ฅผ ์Šค์Šค๋กœ ํƒ์ƒ‰ํ•˜๊ณ  ์ด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, Lang2Act๋Š” VLMs์˜ ์‹œ๊ฐ ์ธ์‹ ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ 4% ์ด์ƒ์˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ธฐ์กด VRAG์˜ ๊ณ ์ •๋œ ์™ธ๋ถ€ ๋„๊ตฌ ์‚ฌ์šฉ ๋ฐ ์‹œ๊ฐ ์ •๋ณด ์†์‹ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์‹œ๊ฐ ์ธ์‹๊ณผ ์ถ”๋ก  ๊ณผ์ •์„ ํ†ตํ•ฉํ•˜๊ณ , ํ•™์Šต ๊ณผ์ •์—์„œ ๋™์ ์œผ๋กœ ๋„๊ตฌ๋ฅผ ์ƒ์„ฑ ๋ฐ ํ™œ์šฉํ•˜๋Š” ์œ ์—ฐ์„ฑ์„ ํ™•๋ณดํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋‘ ๋‹จ๊ณ„์˜ RL ํ•™์Šต ๋ฐฉ์‹์ด ๋ชจ๋ธ์˜ ํšจ๊ณผ์ ์ธ ์–ธ์–ด์  ๋„๊ตฌ ์ƒ์„ฑ ๋ฐ ํ™œ์šฉ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํŠน์ • ๋ณต์žกํ•œ ์‹œ๊ฐ์  ์ถ”๋ก  ์ž‘์—…์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ ๋ฐ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ๋„๋ฉ”์ธ์—์„œ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘