This paper proposes a novel framework for solving the problem of audio-based 3D visual grounding (Audio-3DVG). Unlike existing text-based 3D visual grounding research, we address how to utilize spoken language to locate target objects in 3D point clouds. Rather than treating speech as a single input, we approach this task with two components: (i) an object mention detection module and (ii) an audio-guided attention module. The object mention detection module explicitly identifies objects mentioned in speech, and the audio-guided attention module models the interaction between target candidates and mentioned objects to improve identification in crowded 3D environments. Furthermore, we synthesize spoken descriptions onto existing 3DVG datasets, such as ScanRefer, Sr3D, and Nr3D, to support benchmarking. Experimental results demonstrate that the proposed Audio-3DVG not only achieves state-of-the-art performance in audio-based grounding but is also competitive with text-based methods.