SimSUM is a new benchmark dataset consisting of 10,000 simulated patient records in the respiratory disease field. It connects structured background variables (symptoms, diagnoses, underlying diseases, etc.) generated using Bayesian networks with unstructured clinical records (clinical notes generated by GPT-4o). Clinical notes are annotated with span-level symptom mentions. This dataset is primarily designed to support clinical information extraction research in environments with tabular background variables, and can also be used for clinical inference automation, causal effect estimation in the presence of tabular and/or textual confounders, and multimodal synthetic data generation research. However, it is not suitable for clinical decision support systems or training product-level models.