Sign In

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Created by
  • Haebom
Category
Empty

์ €์ž

Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ์ œ์–ด์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์žฅ๊ธฐ ๊ณผ์ œ์—์„œ์˜ ํ–‰๋™ ์ƒ์„ฑ ๋ถˆ์•ˆ์ • ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋น„๋™๊ธฐ ํ๋ฆ„ ๋งค์นญ(Asynchronous Flow Matching, AFM)์„ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด Vision-Language-Action (VLA) ๋ชจ๋ธ์ธ AsyncVLA๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. AsyncVLA๋Š” ํ–‰๋™ ํ† ํฐ ์ƒ์„ฑ์— ์‹œ๊ฐ„์  ์œ ์—ฐ์„ฑ์„ ๋ถ€์—ฌํ•˜๊ณ , ์ดˆ๊ธฐ ์ƒ์„ฑ๋œ ํ–‰๋™์˜ ์‹ ๋ขฐ๋„๋ฅผ ํ‰๊ฐ€ํ•˜์—ฌ ๋ถ€์ •ํ™•ํ•œ ํ† ํฐ์„ ์„ ํƒ์ ์œผ๋กœ ์ˆ˜์ •ํ•˜๋Š” ์ž๊ธฐ ๊ต์ • ๊ธฐ๋Šฅ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์žฅ๊ธฐ ๋กœ๋ด‡ ์ œ์–ด ๊ณผ์ œ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ธฐ์กด ๋™๊ธฐ์‹ ํ๋ฆ„ ๋งค์นญ(Synchronous Flow Matching, SFM)์˜ ๋ถˆ์•ˆ์ •์„ฑ ๋ฌธ์ œ๋ฅผ ๋น„๋™๊ธฐ์  ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
ํ–‰๋™ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ์ž๊ธฐ ๊ต์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋„์ž…ํ•˜์—ฌ ๊ฒฐ๊ณผ์˜ ์ •ํ™•์„ฑ๊ณผ ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
SFM๊ณผ AFM์„ ํ†ตํ•ฉํ•˜๋Š” ํ›ˆ๋ จ ์ ˆ์ฐจ๋Š” ๋ชจ๋ธ์˜ KV-์บ์‹œ ํ™œ์šฉ๋„๋ฅผ ๋†’์—ฌ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์˜ ๋ณต์žก์„ฑ ์ฆ๊ฐ€ ๋ฐ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๊ฒ€์ฆ์€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ๋กœ ๋‚จ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘