Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

Created by
  • Haebom

Author

Gul Rukh Khattak, Konstantinos Patlatzoglou, Joseph Barker, Libor Pastika, Boroumand Zeidaabadi, Ahmed El-Medany, Hesham Aggour, Yixiu Liang, Antonio H. Ribeiro, Jeffrey Annis, Antonio Luiz Pinho Ribeiro, Junbo Ge, Daniel B. Kramer, Jonathan W. Waks, Evan Brittain, Nicholas Peters, Fu Siong Ng, Arunashis Sau

Outline

This paper presents CAPE, a foundation model for electrocardiogram analysis based on contrastive learning, utilizing electrocardiogram data from a large, diverse population (5,203,352 individuals across three continents). We systematically evaluate the impact of demographic characteristics, health status, and diversity of different cohorts on downstream prediction performance. We find that pretraining with diverse cohorts improves in-distribution accuracy but degrades out-of-distribution (OOD) generalization performance by encoding cohort-specific artifacts. To address this issue, we propose an In-Distribution Batch (IDB) strategy during pretraining that maintains within-cohort consistency and improves OOD robustness. This provides crucial insights for developing clinically fair and generalizable foundation models.

Takeaways, Limitations

Takeaways:
We reveal that the performance of the Foundation Model based on contrastive learning is highly dependent on the distributional characteristics (demographic characteristics, health status, etc.) of the pre-training cohort.
We suggest that pre-training using diverse cohorts contributes to improving in-distribution performance, but may cause degradation in OOD generalization performance.
We propose an IDB strategy to improve OOD robustness, suggesting the possibility of developing a clinically fair and generalizable model.
Limitations:
Further research is needed to determine whether the effectiveness of the IDB strategy can be applied to other types of data or other machine learning methods.
Further validation of generalization performance in real-world clinical settings with different cohort characteristics (e.g., disease distribution) is needed.
Clear explanation of the definitions and selection criteria for the various cohorts is needed.
👍