Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Audio-3DVG: Unified Audio -- Point Cloud Fusion for 3D Visual Grounding

Created by
  • Haebom

Author

Duc Cao-Dinh, Khai Le-Duc, Anh Dao, Bach Phan Tat, Chris Ngo, Duy MH Nguyen, Nguyen X. Khanh, Thanh Nguyen-Tang

Outline

This paper proposes a novel framework for solving the problem of audio-based 3D visual grounding (Audio-3DVG). Unlike existing text-based 3D visual grounding research, we address how to utilize spoken language to locate target objects in 3D point clouds. Rather than treating speech as a single input, we approach this task with two components: (i) an object mention detection module and (ii) an audio-guided attention module. The object mention detection module explicitly identifies objects mentioned in speech, and the audio-guided attention module models the interaction between target candidates and mentioned objects to improve identification in crowded 3D environments. Furthermore, we synthesize spoken descriptions onto existing 3DVG datasets, such as ScanRefer, Sr3D, and Nr3D, to support benchmarking. Experimental results demonstrate that the proposed Audio-3DVG not only achieves state-of-the-art performance in audio-based grounding but is also competitive with text-based methods.

Takeaways, Limitations

Takeaways:
Presenting a novel approach to speech-based 3D visual grounding (Audio-3DVG) and achieving state-of-the-art performance.
Improving 3D environmental understanding through integration of speech and spatial information.
Presenting the possibility of integrating spoken language into 3D vision tasks.
Support for benchmarking through voice description synthesis on existing 3DVG datasets.
Limitations:
Reliance on synthesized voice data. It may not adequately reflect the diverse voice characteristics of real-world environments.
The performance of the object mention detection and voice guidance attention modules can significantly impact overall system performance. There is room for improvement in each module.
Further research is needed on robustness to various speech environments (noise, dialects, etc.).
👍