Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

Created by
  • Haebom

Author

Peng Di, Faqiang Chen, Xiao Bai, Hongjun Yang, Qingfeng Li, Ganglin Wei, Jian Mou, Feng Shi, Keting Chen, Peng Tang, Zhitao Shen, Zheng Li, Wenhui Shi, Junwei Guo, Hang Yu

Outline

The increasing complexity of modern software places an excessive operational burden on Site Reliability Engineering (SRE) teams, raising the need for AI-powered automation that mimics expert diagnostic reasoning. Existing solutions are limited by their lack of deep causal inference or their inability to address SRE's unique, specialized investigative workflows. In this paper, we present OpenDerisk, a specialized open-source multi-agent framework designed for SRE. OpenDerisk integrates a diagnostic-specific collaboration model, a pluggable inference engine, a knowledge engine, and a standardized protocol (MCP) to enable expert agents to collaboratively solve complex, multi-domain problems. Large-scale evaluations demonstrate that OpenDerisk significantly outperforms state-of-the-art infrastructure solutions in terms of accuracy and efficiency. A large-scale production deployment at Ant Group demonstrates its industrial-grade scalability and practical impact.

Takeaways, Limitations

Takeaways:
Development of an AI-based automation framework specialized for SRE tasks.
Improving complex problem-solving capabilities through multi-agent systems.
Large-scale deployment and effectiveness verification in real industrial environments
Contribute to technology sharing and development through open source disclosure
Limitations:
No specific Limitations mentioned in the paper.
👍