This paper highlights the importance of evaluating multimodal comprehension for audiovisual-based models, highlighting the shortcomings of the existing VGGSound dataset (incomplete labeling, partially overlapping classes, and modality misalignment). We demonstrate that these shortcomings can distort the assessment of auditory and visual abilities, and propose VGGSounder, a comprehensively re-annotated multi-label test set, to address these shortcomings. VGGSounder provides detailed modality annotations, enabling modality-specific performance analysis. We also expose model limitations by analyzing the degradation in model performance when additional input modalities are present, using a novel modality confusion metric.