This paper analyzes the internal mechanisms of large-scale audio-language models (LALMs) to gain a deeper understanding of auditory attribute recognition. We apply a lexical projection technique to three state-of-the-art LALMs to track changes in attribute information across layers and token positions. We find that attribute information decreases with increasing layer depth when attribute recognition fails, and that resolving attributes in early layers is correlated with improved accuracy. Furthermore, we reveal that LALMs rely heavily on querying auditory input rather than aggregating necessary information from hidden states at attribute mention locations. Based on these findings, we propose methods to improve the performance of LALMs and suggest directions for future improvements.