Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Using LLMs in Generating Design Rationale for Software Architecture Decisions

Created by
  • Haebom

Author

Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, Chen Yang

Outline

This paper evaluates the performance of large-scale language models (LLMs) in generating and recovering design rationales (DRs) for software architectural decisions. For 100 architectural problems collected from Stack Overflow and GitHub issues and discussions, we generated DRs using five LLMs using three prompting strategies: zero-shot, chain of thought (CoT), and an LLM-based agent. We measured the precision, recall, and F1 scores of the LLM-generated DRs against the DRs provided by human experts. We also analyzed the reliability and applicability of the LLM-generated DRs through interviews with practitioners. The results show that LLMs are useful for DR generation, but their accuracy leaves room for improvement.

Takeaways, Limitations

Takeaways:
We demonstrate that it is possible to automatically generate design rationales (DRs) for software architecture decisions using LLM.
By comparing and analyzing various prompting strategies, we propose a method to improve the efficiency of LLM-based DR generation.
Seeking practical application methods by reflecting practitioners' opinions on the reliability and practicality of LLM-generated DR.
Limitations:
The precision of DR generated by LLM is not high (Precision: 0.267 ~ 0.278, F1-score: 0.351 ~ 0.389).
The accuracy of some LLM-generated DRs is uncertain or may contain errors (4.12% to 4.87% uncertainty, 1.59% to 3.24% error probability).
Not all parts of the DR generated by LLM match the human experts' opinions. Some of the parts not mentioned by the human experts are helpful (64.45% to 69.42%), but some are uncertain or may contain errors.
The dataset size (100 items) is relatively small, requiring further research on generalizability.
👍