This paper addresses a fundamental question in cognitive neuroscience: whether the structure of the external world or the internal structure of the brain shapes visual perception. Given that brain responses to natural stimuli elicit similar activity patterns across individuals, we examine whether convergence driven by such stimuli in the transformation from sensory representations to high-level internal representations follows a common path in humans and deep neural networks (DNNs). By introducing a unified framework that combines cross-individual similarity and alignment with model hierarchy to track representational flow, we analyze three independent fMRI datasets and reveal that the cortical-wide network that is preserved across individuals consists of two pathways: a medial-ventral pathway for scene structure and a lateral-dorsal pathway tuned to social and biological content. This functional organization is captured by the hierarchical structure of visual DNNs but not by language models, enhancing the specificity of visual-to-semantic transformations. In conclusion, we show that convergent computational solutions to visual encoding in both human and artificial vision are driven by the structure of the external world.