To overcome the limitations of traditional community research, this paper presents StreetLens, a customizable workflow that leverages Vision Language Models (VLMs) to perform scalable neighborhood environmental assessments. StreetLens retrieves street view images, focusing on questions derived from interview protocols, and generates semantic annotations ranging from objective features to subjective perceptions. By empowering researchers to leverage domain knowledge to define the role of VLMs, it places domain knowledge at the core of the analysis process, while integrating existing survey data enhances the robustness of the analysis across diverse environments.