Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Created by
  • Haebom

Author

Ruiyan Qi, Congding Wen, Weibo Zhou, Shangsong Liang, Lingbo Li

Outline

This paper presents a novel framework, LETToT (Label-Free Evaluation of LLM on Tourism using Expert Tree-of-Thought), to address the challenges of evaluating large-scale language models (LLMs) in specific domains, such as tourism. This framework overcomes the high cost of annotated benchmarks and persistent issues like hallucinations. LETToT evaluates LLMs in tourism using expert-derived inference structures instead of labeled data. We iteratively refine and validate hierarchical ToT components through alignment with common quality dimensions and expert feedback, and use the optimized expert ToT to evaluate models of various sizes (ranging from 32B to 671B parameters). Our results demonstrate that while scaling law holds true for specific domains (DeepSeek-V3 excels), smaller models with enhanced inference (e.g., DeepSeek-R1-Distill-Llama-70B) narrow the performance gap. Furthermore, we demonstrate that for models smaller than 72B, the explicit inference architecture outperforms in both accuracy and conciseness (p<0.05). This study establishes a scalable, label-free paradigm for domain-specific LLM evaluation, providing a powerful alternative to existing annotated benchmarks.

Takeaways, Limitations

Takeaways:
We present a new paradigm (LETToT) for LLM assessment in specific domains where securing annotated data is difficult, such as tourism.
Presenting the possibility of cost-effective LLM assessment using a label-free assessment method that utilizes expert knowledge.
Suggesting directions for LLM development through performance comparison according to model size and analysis of the effectiveness of inference architecture.
Experimentally demonstrating the superiority of explicit inference architectures in models less than 72B.
Limitations:
The performance of LETToT may depend on the quality of the inference structure provided by the expert.
Because this study was limited to a specific domain (tourism), further research is needed to determine its generalizability to other domains.
The process of gathering expert feedback and iteratively improving ToT components can take significant time and effort.
👍