In this paper, we present a framework called DiaFORGE to address the problem that large-scale language models (LLMs) have difficulty distinguishing between enterprise APIs with similar functionality and correctly calling APIs even under incomplete inputs. DiaFORGE consists of three steps: generating persona-driven multi-turn conversations, fine-tuning the model including the inference process, and evaluating the model’s performance in real-world settings. By training an open-source model with 3B to 70B parameters with DiaFORGE, we achieve a 27% improvement in API call success rate compared to GPT-4o and a 49% improvement compared to Claude-3.5-Sonnet. In addition, we release the DiaBENCH benchmark, which consists of 5,000 enterprise API specifications and verified conversation data, to stimulate future research.