This paper presents a small-scale benchmark for evaluating the urban cognitive abilities of vision-language models (VLMs) to understand how understanding urban landscapes influences design and planning. Using 100 photographs of Montreal streets (with an equal distribution of photographs and realistic synthetic images), 12 participants provided 230 annotation forms across 30 dimensions, combining physical attributes and subjective impressions. Seven VLMs were evaluated using a zero-shot approach, using accuracy and Jaccard superposition. We found that the models' alignment to visible, objective features was stronger than their subjective assessments.