This paper investigates the impact of using Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as an evaluation metric and learning objective in supervised speech separation when using noisy training reference data, such as WSJ0-2Mix. Deriving SI-SDR using noisy reference data reveals that noise limits the achievable SI-SDR or introduces unwanted noise in the separated output. To address this, we propose a method to enhance the reference data using WHAM! and augment mixed data to train models that avoid learning from noisy reference data. Two models trained on the enhanced dataset are evaluated using the non-invasive NISQA.v2 metric. The results demonstrate noise reduction in separated speech, but suggest that artifacts introduced during reference data processing may limit overall quality improvement. A negative correlation between SI-SDR and perceived noise was found on the WSJ0-2Mix and Libri2Mix test sets, supporting the derivation results.