Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PARROT: An Open Multilingual Radiology Reports Dataset

Created by
  • Haebom

Author

Bastien Le Guellec, Kokou Adambounou, Lisa C Adams, Thibault Agripnidis, Sung Soo Ahn, Radhia Ait Chalal, Tugba Akinci D Antonoli, Philippe Amouyel, Henrik Andersson, Raphael Bentegeac, Claudio Benzoni, Antonino Andrea Blandino, Felix Busch, Elif Can, Riccardo Cau, Armando Ugo Cavallo, Christelle Chavihot, Erwin Chiquete, Renato Cuocolo, Eugen Divjak, Gordana Ivanac, Barbara Dziadkowiec Macek, Armel Elogne, Salvatore Claudio Fanni, Carlos Ferrarotti, Claudia Fossataro, Federica Fossataro, Katarzyna Fulek, Michal Fulek, Pawel Gac, Martyna Gachowska, Ignacio Garcia Juarez, Marco Gatti, Natalia Gorelik, Alexia Maria Goulianou, Aghiles Hamroun, Nicolas Herinirina, Krzysztof Kraik, Dominik Krupka, Quentin Holay, Felipe Kitamura, Michail E Klontzas, Anna Kompanowska, Rafal Kompanowski, Alexandre Lefevre, Tristan Lemke, Maximilian Lindholz, Lukas Muller, Piotr Macek, Marcus Makowski, Luigi Mannacio, Aymen Meddeb, Antonio Natale, Beatrice Nguema Edzang, Adriana Ojeda, Yae Won Park, Federica Piccione, Andrea Ponsiglione, Malgorzata Poreba, Rafal Poreba, Philipp Prucker, Jean Pierre Pruvo, Rosa Alba Pugliesi, Feno Hasina Rabemanorintsoa, Vasileios Rafailidis, Katarzyna Resler, Jan Rotkegel, Luca Saba, Ezann Siebert, Arnaldo Stanzione, Ali Fuat Tekin, Liz Toapanta Yanchapaxi, Matthaios Triantafyllou, Ekaterini Tsaoulia, Evangelia Vassalou, Federica Vernuccio, Johan Wasselius, Weilang Wang, Szymon Urban, Adrian Wlodarczak, Szymon Wlodarczak, Andrzej Wysocki, Lina Xu, Tomasz Zatonski, Shuhang Zhang, Sebastian Ziegelmayer, Gregory Kuchcinski, Keno K Bressem

Outline

PARROT (Polyglottal Annotated Radiology Reports for Open Testing) is a large-scale, multi-center, open-access dataset of multilingual radiology reports developed and validated for testing natural language processing applications in radiology. From May to September 2024, radiologists were invited to contribute fictitious radiology reports, following standard reporting practices, providing at least 20 reports with metadata including anatomic region, imaging modality, clinical context, and English translations of non-English reports. All reports were assigned an ICD-10 code. A human-to-AI report discrimination study evaluated whether reports were human- or AI-generated, with 154 participants (radiologists, medical professionals, and non-medical professionals). The dataset consists of 2,658 radiology reports in 13 languages from 76 authors across 21 countries. The most common imaging modalities were CT (36.1%), MRI (22.8%), radiography (19.0%), and ultrasound (16.8%), and the most common anatomical regions were thoracic (19.9%), abdomen (18.6%), head (17.3%), and pelvic (14.1%). In a discrimination study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human-generated and AI-generated reports, with radiologists performing significantly better than the other group (56.9%, 95% CI: 53.3%-60.6%, p<0.05). PARROT is the largest publicly available multilingual radiology report dataset, enabling the development and validation of natural language processing applications across language, region, and clinical boundaries without privacy restrictions.

Takeaways, Limitations

Takeaways:
Contributing to the advancement of natural language processing research by providing a large-scale open dataset of multilingual, multi-institutional radiology reports.
Develop and validate natural language processing models across multiple languages and clinical environments without privacy concerns.
Providing insights into evaluating model performance through a study of the distinction between human-written and AI-generated reports.
Limitations:
Because the dataset is based on fictional reports, there may be differences from real clinical settings.
Despite the diversity of report authors, there may be biases in certain languages or regions.
The accuracy of human-to-AI report discrimination studies is not high, requiring further research on the reliability of model performance evaluations.
👍