Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

Created by
  • Haebom

Author

Yufeng Zhao, Junnan Liu, Hongwei Liu, Dongsheng Zhu, Yuan Shen, Songyang Zhang, Kai Chen

Outline

This paper comprehensively evaluates the effectiveness of Tool-Integrated Inference (TIR) for improving the inference performance of large-scale language models (LLMs). To overcome the limitations of LLMs, which struggle with accurate computation using conventional Chain of Thought (CoT) methods, we leverage TIR and present the ReasonZoo benchmark, which encompasses nine diverse inference categories. Furthermore, we propose new metrics for evaluating inference efficiency: Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC). Experimental results show that TIR-based models outperform non-TIR-based models on both mathematical and non-mathematical tasks. Furthermore, the PAC and AUC-PCC metrics are improved, demonstrating increased inference efficiency. This suggests that TIR can enhance the ability of LLMs to solve complex inference tasks.

Takeaways, Limitations

Takeaways:
We experimentally demonstrate that tool-integrated inference (TIR) improves the overall reasoning ability of LLM.
The effectiveness of TIR was confirmed in both mathematical and non-mathematical problems.
The proposed new metrics PAC and AUC-PCC are useful for evaluating inference efficiency.
TIR reduces 'overthinking' in LLM and makes the reasoning process more efficient.
Limitations:
Further research is needed on the generalizability and scalability of the ReasonZoo benchmark.
Further research is needed to determine the generalizability of TIR across different types of tools and LLMs.
Further research is needed on the interpretation and utilization of the proposed new indices PAC and AUC-PCC.
👍