Sign In

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(MLLM)์˜ 3D ๊ณต๊ฐ„ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํƒ์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์ธ ViSRA๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ViSRA๋Š” ๋ณ„๋„์˜ ํ•™์Šต ๊ณผ์ • ์—†์ด ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ ๋ช…์‹œ์ ์ธ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ MLLM์˜ ๊ณต๊ฐ„ ์ถ”๋ก  ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋ชจ๋“ˆ์‹์œผ๋กœ ์ž‘๋™์‹œํ‚ต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ›ˆ๋ จ ์—†์ด๋„ MLLM์˜ 3D ๊ณต๊ฐ„ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋‹ค์–‘ํ•œ 3D ๊ณต๊ฐ„ ์ถ”๋ก  ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
ํ›ˆ๋ จ ์—†์ด MLLM์˜ 3D ๊ณต๊ฐ„ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹ ์ œ์‹œ
โ€ข
์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์ „์ด ๊ฐ€๋Šฅํ•œ 3D ๊ณต๊ฐ„ ์ดํ•ด ๋Šฅ๋ ฅ ํ™•๋ณด
โ€ข
๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ ๋ฐ ๋ฏธ์ง€์˜ 3D ๊ณต๊ฐ„ ์ถ”๋ก  ์ž‘์—… ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํ™•์ธ
โ€ข
ViSRA ์ž์ฒด์˜ ๊ณต๊ฐ„ ์ •๋ณด ์ถ”์ถœ ๋ชจ๋ธ ์„ฑ๋Šฅ ๋ฐ ํšจ์œจ์„ฑ ๊ฐœ์„ ์˜ ์—ฌ์ง€
๐Ÿ‘