Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Created by
  • Haebom

Author

Ruiyan Qi, Congding Wen, Weibo Zhou, Jiwei Li, Shangsong Liang, Lingbo Li

Outline

This paper proposes LETToT (Label-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought), a label-free LLM evaluation framework that leverages expert-derived inference structures to address the challenges of evaluating large-scale language models (LLMs) in specific domains such as tourism, particularly the high cost of annotated benchmarks and persistent issues such as hallucinations. LETToT iteratively refines and validates hierarchical ToT components using common quality dimensions and expert feedback. Experimental results show that systematically optimized expert ToTs achieve relative quality improvements of 4.99-14.15% compared to baselines. Furthermore, we evaluate models of various sizes (32B-671B parameters) and confirm that the scaling law holds even in specific domains (DeepSeek-V3 excels), while smaller models with enhanced inference (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap. For models with less than 72B, the explicit inference architecture demonstrated superior accuracy and parsimoniousness (p<0.05). This study establishes a scalable, label-free paradigm for domain-specific LLM evaluation, offering a compelling alternative to existing annotated benchmarks.

Takeaways, Limitations

Takeaways:
We present a novel label-free framework, LETToT, for LLM assessment in specific domains such as tourism.
Reduced dependence on annotation data by leveraging expert knowledge-based inference structures.
Analysis of scaling laws and the effectiveness of inference architectures through comparative evaluations of LLMs of various scales.
Presenting an alternative evaluation method that overcomes the limitations of existing benchmarks.
Suggesting the possibility of performance improvement of small-scale models with enhanced inference.
Limitations:
The performance of LETToT may depend on the quality of the inference structure provided by the expert.
Generalization may be limited as the research results are limited to a specific domain (tourism).
Further research is needed to ensure the objectivity of evaluation metrics and expert feedback.
Scalability to other domains needs to be verified.
👍