Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Tournament of Prompts: Evolving LLM Instructions Through Structured Debates and Elo Ratings

Created by
  • Haebom

Author

Anirudh Nair, Adi Banerjee, Laurent Mombaerts, Matthew Hagen, Tarik Borogovac

Outline

This paper addresses the challenge of prompt engineering in maximizing the potential of large-scale language models (LLMs), especially in tasks that require subjective quality assessments, where explicit optimization objectives are difficult to define. While existing automatic prompt optimization methods are not effective for such problems, in this paper we present DEEVO, a novel prompt optimization framework that leverages discussion-based evaluation and Elo-based selection. DEEVO explores the discrete prompt space while maintaining semantic consistency through intelligent crossover and strategic mutation operations. It simultaneously pursues prompt improvement and diversity using Elo ratings as a relevance metric, and outperforms existing methods on both open and closed problems without correct-answer feedback. Combining the inference capabilities of LLMs and adaptive optimization, it contributes to continuously improving AI systems without a predefined metric.

Takeaways, Limitations

Takeaways:
We present a novel method to effectively solve prompted optimization problems for complex tasks that require subjective quality assessment.
Increase usability by effectively optimizing prompts without correct answer feedback.
Leveraging LLM's inference capabilities to suggest possibilities for continuous improvement of AI systems.
Overcoming the limitations of existing automatic prompt optimization methods.
Limitations:
There is a possibility that DEEVO's performance may depend on certain types of tasks or LLMs.
Due to the limitations of Elo-based evaluation methods, there is no guarantee that the optimal prompt will always be found.
Further validation of generalizability through large-scale experiments is needed.
Lack of detailed description of the specific mechanisms of discussion-based evaluation may make it difficult to ensure reproducibility.
👍