Sign In

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Created by
  • Haebom
Category
Empty

์ €์ž

Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He, Fei Wang, Heng Yang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๊ธฐ์กด ํ๋ฆ„ ๊ธฐ๋ฐ˜ VLA(Vision-Language-Action) ์ •์ฑ…์˜ ๋น„ํšจ์œจ์ ์ธ ์ถ”๋ก  ๊ณผ์ •์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. CF-VLA๋Š” Gaussian ๋…ธ์ด์ฆˆ์—์„œ ์ง์ ‘ ํ–‰๋™ ๊ตฌ์กฐ๋ฅผ ๋ณต์›ํ•˜๋Š” ๋Œ€์‹ , ํ–‰๋™ ์ธ์‹ ์ดˆ๊ธฐ์ ์„ ์ƒ์„ฑํ•˜๋Š” coarse ๋‹จ๊ณ„์™€ ์ž”์—ฌ ์˜ค์ฐจ๋ฅผ ๋ณด์ •ํ•˜๋Š” fine ๋‹จ๊ณ„๋กœ ์ด์›ํ™”ํ•˜์—ฌ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ๋†’์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‚ฎ์€ NFE(Number of Function Evaluations) ํ™˜๊ฒฝ์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ๊ณผ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ตฌ์กฐํ™”๋œ ์ดˆ๊ธฐ์ ์˜ ์ค‘์š”์„ฑ: ํ๋ฆ„ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ ์ดˆ๊ธฐ์ ์˜ ๊ตฌ์กฐํ™”๊ฐ€ ์ถ”๋ก  ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ฒฐ์ •์ ์ธ ์—ญํ• ์„ ํ•จ์„ ์ž…์ฆํ–ˆ๋‹ค.
โ€ข
์‹ค์‹œ๊ฐ„ ์ œ์•ฝ ์กฐ๊ฑด ํ•˜์—์„œ์˜ ํšจ์œจ์„ฑ: ๋‚ฎ์€ NFE ํ™˜๊ฒฝ์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ์„ฑ๋Šฅ ๋ชจ๋ธ๊ณผ ๋™๋“ฑํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ๋„ ์ถ”๋ก  ์ง€์—ฐ ์‹œ๊ฐ„์„ ํš๊ธฐ์ ์œผ๋กœ ๋‹จ์ถ•ํ•˜์—ฌ ์‹ค์‹œ๊ฐ„ ๋กœ๋ด‡ ์‘์šฉ์— ์ ํ•ฉํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.
โ€ข
ํ›ˆ๋ จ ์•ˆ์ •ํ™” ์ „๋žต: ๋‹จ๊ณ„๋ณ„ ํ›ˆ๋ จ ์ „๋žต์„ ํ†ตํ•ด coarse ์˜ˆ์ธก๊ธฐ๋ฅผ ๋จผ์ € ํ•™์Šต์‹œํ‚จ ํ›„ ๊ณต๋™ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ํ›ˆ๋ จ์˜ ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ–ˆ๋‹ค.
โ€ข
ํ•œ๊ณ„์ /ํ–ฅํ›„ ๊ณผ์ œ: ์ œ์•ˆ๋œ coarse-to-fine ๋ฐฉ์‹์ด ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์ž‘์—… ๋ฐ ํ™˜๊ฒฝ์— ์–ผ๋งˆ๋‚˜ ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, coarse ๋‹จ๊ณ„์˜ ์ •ํ™•๋„๊ฐ€ fine ๋‹จ๊ณ„์˜ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•œ ์‹ฌ์ธต ๋ถ„์„์ด ์š”๊ตฌ๋œ๋‹ค.
๐Ÿ‘