[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

Created by
  • Haebom

Author

Peilong Wang, Jason Holmes, Zhengliang Liu, Dequan Chen, Tianming Liu, Jiajian Shen, Wei Liu

Outline

This study evaluated the ability of five recently released large-scale language models (LLMs) (OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet) to answer radiation oncology physics questions. The performance of the models was evaluated using 100 multiple-choice questions written by professional physicists, and the reasoning ability was evaluated by randomly arranging the correct answer options or replacing them with “None of the above answers is correct.” We also examined whether the reasoning ability was improved using the “Explain first” and “Step-by-step” prompts. As a result, all models showed expert-level performance, and o1-preview outperformed medical physicists in majority voting. However, when the correct answer option was replaced with “None of the above answers is correct,” the performance was significantly reduced, suggesting the need for improvement in reasoning ability. The “Explain first” and “Step-by-step” prompts contributed to the improvement of the reasoning ability of some models.

Takeaways, Limitations

Takeaways:
Recent LLMs demonstrate expert-level ability to answer radiation oncology physics questions.
To present the potential of LLM in radiation oncology physics education and training.
Certain prompting strategies (explanation first, step-by-step) have been shown to be effective in improving the reasoning skills of some LLMs.
Limitations:
Adding the option "None of the above answers are correct" resulted in poor model performance and the need to improve inference capabilities.
The number of problems used (100) may be relatively small.
The use of majority voting to evaluate the model's performance.
The effectiveness of a particular prompting strategy does not apply to all models.
👍