[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications

Created by
  • Haebom

Author

Felix H arer

Outline

This paper presents the results of an exploratory study on multi-agent systems that leverage the inference capabilities of modern large-scale language models (LLMs) to apply them to domain-specific applications. In particular, we focus on how to combine inference techniques, code generation, and software execution via multiple specialized LLMs. Unlike previous studies that evaluate LLMs, inference techniques, and applications separately, this paper defines a clear specification for a multi-agent LLM system and introduces an agent schema language to present a method for implementing and evaluating the specification via a multi-agent system architecture and prototype. We demonstrate the feasibility of the architecture and evaluation approach through test cases involving cybersecurity tasks, and present evaluation results through successful completion of question answering, server security, and network security tasks using LLMs from OpenAI and DeepSeek.

Takeaways, Limitations

Takeaways:
We present an agent schema language for multi-agent LLM systems to clarify system specifications and enable systematic evaluation.
Provides a framework for integrated application and evaluation of LLM, inference techniques, and applications through multi-agent system architectures and prototypes.
The feasibility and usability of the proposed system are verified through practical test cases including cybersecurity tasks.
Limitations:
Further research is needed on the generality and extensibility of the proposed agent schema language and system architecture.
Further evaluation of the system's performance and reliability for diverse domains and complex tasks is needed.
Further analysis is needed on the dependence on the characteristics of the LLM used and the generalizability to other LLMs.
👍