This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes MUSE (Multi-LLM Uncertainty via Subset Ensembles), an uncertainty quantification method that leverages model diversity to address the inconsistency problem of large-scale language models (LLMs). MUSE uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs, providing more reliable uncertainty estimates. It is based on the assumption that LLMs provide complementary predictions due to their different learning processes and the Zipfian distribution of languages. This method demonstrates improved calibration and prediction performance compared to single-model and simple set-based models in binary prediction tasks. We also explore how MUSE can be used in conjunction with chain-of-thought distillation to fine-tune the calibration of LLMs. MUSE is available on GitHub.
Takeaways, Limitations
•
Takeaways:
◦
We demonstrate that leveraging the model diversity of LLM can improve the accuracy of uncertainty estimation.
◦
Jensen-Shannon Divergence-based MUSE method outperforms single-model and simple set-based models.
◦
Possibility of improving LLM correction through combination with chain-of-thought distillation.
◦
Providing the possibility of expanding research and utilization through open-source release of the developed MUSE method.
•
Limitations:
◦
Currently, only experimental results for binary classification problems are presented, and further research is needed to determine generalizability to multi-class classification or other types of tasks.
◦
MUSE's performance improvements may be limited to specific datasets and models, and its generalizability across a variety of situations needs to be verified.
◦
There is a lack of comparative performance analysis using information theoretic metrics other than Jensen-Shannon Divergence.
◦
Further research is needed to optimize the subset selection strategy of LLM.