[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

Created by
  • Haebom

Author

Lingbo Li, Anuradha Mathrani, Teo Susnjak

Outline

This study evaluated the practical performance of data extraction automation from specialized RCTs (randomized controlled trials) for meta-analysis. Three large-scale language models (Gemini-2.0-flash, Grok-3, GPT-4o-mini) were used in three medical fields (hypertension, diabetes, and orthopedics) to perform statistical output, bias risk assessment, and study-level feature extraction tasks. Four prompting strategies (default prompts, self-reflective prompts, model ensembles, and custom prompts) were tested to explore ways to improve extraction quality. All models showed high precision but low recall due to omission of key information, and we found that custom prompts were the most effective way to improve recall by up to 15%. Based on this, we propose a three-level LLM usage guideline that matches the level of automation according to task complexity and risk, providing practical advice for data extraction automation in real-world meta-analysis, and aiming to balance expert supervision and LLM efficiency through goal-oriented and task-specific automation.

Takeaways, Limitations

Takeaways:
Tailored prompts can improve the reproducibility of RCT data extraction using LLM.
A three-level LLM usage guide that matches the level of automation according to task complexity and risk provides practical assistance for real-world meta-analyses.
Automation using LLM can increase the efficiency of meta-analysis.
It presents a balanced approach to utilizing expert supervision and LLM effectiveness.
Limitations:
Recall was low in all models. Omission of key information was a persistent problem.
Because the research results are limited to a specific medical field, generalizability may be limited.
Further research is needed on more diverse LLM and prompting strategies.
👍