This paper presents GeoGuess, a new benchmark task for multimodal inference that understands, integrates, and infers diverse data modes. GeoGuess takes street photographs as input and identifies locations and provides detailed descriptions. This task requires inference on various levels of visual cues, such as local details and overall context, and the ability to connect them with extensive geographic knowledge. We present GeoExplain (a panorama-coordinate-description tuple), a benchmark dataset for GeoGuess, and propose SightSense, a multimodal and multilevel inference method that generates predictions and comprehensive descriptions based on hierarchical visual information and external knowledge. Experimental results demonstrate that SightSense performs well on the GeoGuess task.