Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models

Created by
  • Haebom

Author

Aakash Tripathi, Asim Waqas, Matthew B. Schabath, Yasin Yilmaz, Ghulam Rasool

Outline

HONeYBEE is an open-source multimodal biomedical data integration framework for oncology applications. It processes structured and unstructured clinical data, whole-slide images, radiology scans, and molecular profiles, generating integrated patient-level embeddings using domain-specific base models and fusion strategies. These embeddings enable survival prediction, cancer type classification, patient similarity retrieval, and cohort clustering. When evaluated on over 11,400 patients across 33 cancer types from the TCGA, the clinical embeddings demonstrated the strongest unimodal performance, with 98.5% classification accuracy and 96.4% precision @10 for patient retrieval. It also achieved the highest survival prediction concordance index across most cancer types. Multimodal fusion offers complementary benefits for specific cancers, improving overall survival prediction beyond what clinical features alone can achieve. Comparative evaluations of four large-scale language models show that general-purpose models like Qwen3 improve task-specific fine-tuning performance on heterogeneous data such as pathology reports, but outperform specialized medical models on clinical text representation.

Takeaways, Limitations

Takeaways: We present an effective framework for integrating various medical data modalities to improve oncology research and prediction performance. Specifically, we demonstrate the superior performance of clinical data-based embeddings. We demonstrate the potential for improving survival prediction through multimodal fusion. We also validate the performance of a general-purpose LLM for processing medical data.
Limitations: Reliance on the TCGA dataset. Generalizability to other datasets needs to be verified. The effectiveness of multimodal fusion for certain cancer types may be limited. Further research is needed to determine the model's interpretability and explanatory power.
👍