In this paper, we present a method to perform unannotated 3D object detection by leveraging a 2D vision-language model trained on web-scale image-text pairs to overcome the limitations of existing 3D object detection datasets (narrow class classification and expensive manual annotation). We generate text condition proposals using the 2D vision-language detector, segment them using SAM, and then project them to 3D using camera geometry and LiDAR or monocular pseudo-depth. We infer 3D bounding boxes without training using DBSCAN clustering and a geometric dilation strategy based on Rotating Calipers. We also construct Pseudo-nuScenes, a variant of the RGB-only nuScenes dataset with added fog to simulate harsh conditions of real environments. We experimentally demonstrate that it achieves competitive localization performance on multiple settings, including LiDAR-based and pure RGB-D inputs, and that it requires no training and supports an open vocabulary.