This paper comprehensively evaluates the effectiveness of Tool-Integrated Inference (TIR) for improving the inference performance of large-scale language models (LLMs). To overcome the limitations of LLMs, which struggle with accurate computation using conventional Chain of Thought (CoT) methods, we leverage TIR and present the ReasonZoo benchmark, which encompasses nine diverse inference categories. Furthermore, we propose new metrics for evaluating inference efficiency: Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC). Experimental results show that TIR-based models outperform non-TIR-based models on both mathematical and non-mathematical tasks. Furthermore, the PAC and AUC-PCC metrics are improved, demonstrating increased inference efficiency. This suggests that TIR can enhance the ability of LLMs to solve complex inference tasks.