Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

Created by
  • Haebom

Author

Maria Ana Cardei, Josephine Lamp, Mark Derdzinski, Karan Bhatia

Outline

DM-Bench is the first benchmark designed to evaluate the performance of large-scale language models (LLMs) on daily life decision-making tasks for people with diabetes. It provides a comprehensive evaluation framework specifically designed for prototyping patient-centered AI solutions in the areas of diabetes, glycemic management, and metabolic health. Covering seven task categories, it generates 360,600 personalized questions based on one month of time-series data (blood glucose tracking from continuous glucose monitoring (CGM) and behavioral logs such as meal and activity patterns) collected from 15,000 individuals across three diabetes populations (Type 1, Type 2, and prediabetes/general health and wellness). Model performance is evaluated across five metrics: accuracy, evidence base, safety, clarity, and feasibility, through analysis of eight state-of-the-art LLMs.

Takeaways, Limitations

Takeaways:
Contribute to improving the reliability, safety, effectiveness, and practicality of AI solutions for diabetes patients.
Comprehensive performance evaluation of the LLM through seven task categories based on questions from actual diabetes patients.
Generate personalized questions using a large dataset containing diverse populations of diabetes patients.
Model performance is evaluated comprehensively using five indicators: accuracy, validity, safety, clarity, and feasibility.
Performance comparisons between different LLMs allow you to identify the strengths and weaknesses of each model.
Limitations:
No single model consistently outperforms all metrics.
(There is no specific mention of Limitations in the paper)
👍