Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making

Created by
  • Haebom

Author

Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, Bin Liu

Outline

This paper highlights the limitations of the "multiple-input, single-output" (MISO) architecture employed by existing large-scale pre-trained models, such as ChatGPT and OpenVLA. This architecture causes task mutual exclusion in "multiple-input, multiple-output" (MIMO) tasks (e.g., parallel multi-task output processing), leading to resource competition among multiple tasks when sharing output channels, resulting in optimization imbalances and performance degradation. In contrast, humans can simultaneously execute tasks without interference through MIMO processing (e.g., concurrent conversation and decision-making). Inspired by this, we propose the Visual Language Action Model for Simultaneously Chatting and Decision Making (VLASCD, or MIMO-VLA), an integrated MIMO-trained model with parallel multi-task outputs capable of simultaneous conversation and decision-making. Experimental results on the CARLA autonomous driving platform demonstrate that MIMO-VLA significantly outperforms LLM models with MISO conversation capabilities, reinforcement learning models, and VLA models with MISO decision capabilities in simultaneously processing conversation and decision-making tasks in MIMO scenarios.

Takeaways, Limitations

Takeaways:
We present the utility of the MIMO structure that overcomes the limitations of the MISO structure.
We propose a new model, VLASCD (MIMO-VLA), which is effective for performing complex tasks such as simultaneous conversation and decision-making.
We experimentally verify the superior performance of MIMO-VLA in autonomous driving applications.
Limitations:
Further research is needed to evaluate the generalization performance of the proposed model.
These experimental results are limited to the CARLA platform, and performance in other environments requires further validation.
There is a lack of analysis of the model's complexity and computational cost.
👍