Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Created by
  • Haebom

Author

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang

Outline

This paper studies strategies to improve the data analysis capability of open source large-scale language models (LLMs). Using seed datasets consisting of various realistic scenarios, we evaluate the model in three aspects: data understanding, code generation, and strategic planning. The evaluation results reveal three main findings: the quality of strategic planning is a key determinant of model performance; interaction design and task complexity have significant effects on inference capability; and data quality has a greater impact than diversity on achieving optimal performance. Based on these insights, we develop a data synthesis methodology to significantly improve the analytical inference capability of open source LLMs.

Takeaways, Limitations

Takeaways:
Presenting a data synthesis methodology to improve data analysis capabilities of open source LLM
Emphasize the importance of strategic planning, interaction design, and data quality to improve model performance
We show that data quality has a greater impact on performance than data diversity.
Limitations:
Generalizability and limitations of the seed dataset used in the study
The applicability of the proposed data synthesis methodology to other open source LLMs and various data analysis tasks needs to be verified.
Further research is needed on the interactions and impacts of strategic planning, interaction design, and data quality.
👍