Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Multimodal Medical Code Tokenizer

Created by
  • Haebom

Author

Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik

Outline

In this paper, we propose MedTok, which improves the tokenization of medical data used in the base model trained on patient electronic health records (EHRs). While existing tokenization methods treat medical codes as simple text tokens, MedTok considers the text description of a medical code, its hierarchical position, and its relationships with other codes (e.g., disease co-occurrence, drug therapy associations). It processes the text and relational structure using a language model encoder and a graph encoder, and quantizes it into a unified token space to preserve modality features and cross-modality information. In various experiments (prediction, diagnosis classification, drug recommendation, and risk stratification) using MIMIC-III, MIMIC-IV, and EHRShot datasets, it improves AUPRC compared to existing tokenization methods, and shows great performance in drug recommendation in particular. In addition, we applied MedTok to a medical QA system and confirmed its performance improvement.

Takeaways, Limitations

Takeaways:
We present MedTok, a novel tokenization method that leverages both textual descriptions and relationship information of medical codes.
Experimentally verified performance improvement over existing methods in various EHR models and tasks.
Suggests potential expansion to other medical applications, such as medical QA systems.
It shows a significant performance improvement, especially in drug recommendation tasks.
Limitations:
MedTok's performance improvement may vary across datasets (performance differences in MIMIC-III, MIMIC-IV, and EHRShot).
Further research is needed on the scalability of MedTok to effectively handle over 600,000 medical codes.
Further comparative analysis with other medical language models or tokenization techniques is needed.
👍