Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Framework for Testing and Adapting REST APIs as LLM Tools

Created by
  • Haebom

Author

Jayachandu Bandlamudi, Ritwik Chaudhuri, Neelamadhav Gantayat, Sambit Ghosh, Kushal Mukherjee, Prerna Agarwal, Renuka Sindhgatta, Sameep Mehta

Outline

This paper presents a testing framework for autonomous agents based on large-scale language models (LLMs) to overcome the challenges of complex input schemas and detailed responses associated with enterprise APIs, enabling them to perform complex tasks. This framework systematically evaluates enterprise APIs wrapped in Python tools. It generates test cases that recognize data, translate these into natural language commands, and evaluates whether the agent correctly invokes the tool, processes input, and processes responses. We generate over 2,400 test cases across various domains and develop a classification system for common errors, including input misinterpretation, output failures, and schema mismatches, to support debugging and tool improvement. Ultimately, this framework provides a systematic approach for leveraging enterprise APIs as reliable tools for agent-based applications.

Takeaways, Limitations

Takeaways:
Providing a systematic framework for evaluating the reliability of enterprise APIs for LLM-based agents.
Efficient API evaluation through data recognition test case generation and natural language command translation.
Generate extensive test cases across various domains and develop a general fault classification scheme.
Support for debugging API errors and improving tools.
Improving reliability for agent-based application development.
Limitations:
The current framework is limited to Python tools, which may limit its applicability to other language-based tools.
The accuracy of the evaluation results may vary depending on the scope and depth of the test cases.
Dependency on specific enterprise systems and APIs may make generalization difficult.
Additional validation of the comprehensiveness and accuracy of the error classification scheme may be required.
👍