Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

Created by
  • Haebom

Author

Rashid Mushkani

Outline

This paper presents a small-scale benchmark for evaluating the urban cognitive abilities of vision-language models (VLMs) to understand how understanding urban landscapes influences design and planning. Using 100 photographs of Montreal streets (with an equal distribution of photographs and realistic synthetic images), 12 participants provided 230 annotation forms across 30 dimensions, combining physical attributes and subjective impressions. Seven VLMs were evaluated using a zero-shot approach, using accuracy and Jaccard superposition. We found that the models' alignment to visible, objective features was stronger than their subjective assessments.

Takeaways, Limitations

Takeaways:
Building a benchmark for evaluating the performance of VLMs related to urban cognition.
We confirm the model's strong performance on objective properties.
Disclosure of benchmarks, prompts, and tools for reproducible evaluation.
Presenting the possibility of using it for participatory urban analysis.
Limitations:
Model performance on subjective evaluations is relatively low.
Slight performance degradation on synthetic images.
The size of the dataset used is small.
👍