Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Created by
  • Haebom

Author

Paloma Rabaey, Stefan Heytens, Thomas Demeester

Outline

SimSUM is a new benchmark dataset consisting of 10,000 simulated patient records in the respiratory disease field. It connects structured background variables (symptoms, diagnoses, underlying diseases, etc.) generated using Bayesian networks with unstructured clinical records (clinical notes generated by GPT-4o). Clinical notes are annotated with span-level symptom mentions. This dataset is primarily designed to support clinical information extraction research in environments with tabular background variables, and can also be used for clinical inference automation, causal effect estimation in the presence of tabular and/or textual confounders, and multimodal synthetic data generation research. However, it is not suitable for clinical decision support systems or training product-level models.

Takeaways, Limitations

Takeaways:
Provides a new benchmark dataset for clinical information extraction studies, including background information in tabular form.
It can be used for research in automating clinical inference, estimating causal effects, and generating multimodal synthetic data.
Contributes to improving the reproducibility of research by clarifying the link between structured data and unstructured text.
Limitations:
Since this is simulation data, there may be differences from actual clinical data.
Not suitable for clinical decision support systems or training production-level models.
The size of the dataset (10,000 records) may not fully reflect the complexity of real clinical environments.
👍