Sign In

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด ์˜ค๋””์˜ค-๋น„์ฃผ์–ผ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(AV-LLMs)์ด 2D ์˜์ƒ๊ณผ ๋‹จ์ผ ์ฑ„๋„ ์Œ์„ฑ์— ๊ตญํ•œ๋˜์–ด 3D ๊ณต๊ฐ„์—์„œ์˜ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ์†Œ์Šค ์œ„์น˜ ํŒŒ์•… ๋ฐ ๊ณต๊ฐ„ ์ถ”๋ก ์— ํ•œ๊ณ„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ RGB-D ๊ด€์ธก๊ณผ ๋‹ค์ฑ„๋„ ์•ฐ๋น„์†Œ๋‹‰์Šค๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ JAEGER๋ผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด AV-LLM์„ 3D ๊ณต๊ฐ„์œผ๋กœ ํ™•์žฅํ•˜์—ฌ ๊ณต๋™ ๊ณต๊ฐ„ ์ ‘์ง€ ๋ฐ ์ถ”๋ก ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
3D ๊ณต๊ฐ„์—์„œ์˜ ์˜ค๋””์˜ค-๋น„์ฃผ์–ผ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ ์ถ”๋ก ์„ ์œ„ํ•œ ๋ช…์‹œ์ ์ธ 3D ๋ชจ๋ธ๋ง์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์‹ ๊ฒฝ ๊ฐ•๋„ ๋ฒกํ„ฐ(Neural IV)๋ผ๋Š” ์ƒˆ๋กœ์šด ํ•™์Šตํ˜• ๊ณต๊ฐ„ ์Œํ–ฅ ํ‘œํ˜„์„ ํ†ตํ•ด ์˜ค๋ฒ„๋žฉ๋˜๋Š” ์†Œ์Œ ํ™˜๊ฒฝ์—์„œ๋„ ์ •ํ™•ํ•œ ๋ฐฉํ–ฅ ํƒ์ง€๋ฅผ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
61,000๊ฐœ์˜ ์ƒ˜ํ”Œ๋กœ ๊ตฌ์„ฑ๋œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ ๊ธฐ๋ฐ˜์˜ SpatialSceneQA ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ตฌ์ถ•ํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ ์ฒด๊ณ„์ ์ธ ํ‰๊ฐ€๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ ์—ฐ๊ตฌ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์— ๊ตญํ•œ๋˜์–ด ์žˆ์–ด ์‹ค์ œ ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™” ๋ฐ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘