Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Created by
  • Haebom

Author

Ashutosh Hathidara, Julien Yu, Sebastian Schreiber

Outline

In this paper, we present a framework called DiaFORGE to address the problem that large-scale language models (LLMs) have difficulty distinguishing between enterprise APIs with similar functionality and correctly calling APIs even under incomplete inputs. DiaFORGE consists of three steps: generating persona-driven multi-turn conversations, fine-tuning the model including the inference process, and evaluating the model’s performance in real-world settings. By training an open-source model with 3B to 70B parameters with DiaFORGE, we achieve a 27% improvement in API call success rate compared to GPT-4o and a 49% improvement compared to Claude-3.5-Sonnet. In addition, we release the DiaBENCH benchmark, which consists of 5,000 enterprise API specifications and verified conversation data, to stimulate future research.

Takeaways, Limitations

Takeaways:
We present the potential for performance improvements in LLM, distinguishing between APIs with similar functionality and calling APIs accurately even with incomplete inputs.
Provides dynamic benchmarks and evaluation methodologies for performance evaluation in real-world environments.
We are releasing an open dataset containing 5,000 enterprise API specifications and conversational data to support follow-up research.
The model developed through DiaFORGE significantly improved the API call success rate compared to existing models.
Limitations:
Further validation of the generalizability of the DiaBENCH benchmark is needed.
Generalization performance across different types of enterprise APIs requires further study.
Additional research may be needed on the scalability and maintainability of the DiaFORGE framework.
👍