Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Created by
  • Haebom

Author

Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu

Outline

This paper highlights that current benchmarks for evaluating large-scale language models (LLMs) are heavily focused on standardized writing styles, failing to adequately reflect diverse human communication patterns. To test the hypothesis that LLMs may be vulnerable to non-standard input, we leverage persona-based LLM prompting to mimic diverse writing styles and analyze the impact of variations in writing style and format of prompts with identical semantic content on LLM performance. Our results demonstrate that specific writing styles consistently lead to lower or higher performance across diverse LLM models and tasks, regardless of model type, size, or recentness. This study presents a scalable approach that extends existing benchmarks to enhance the external validity of LLM performance evaluations based on linguistic variation.

Takeaways, Limitations

Takeaways:
We address the issue of lack of writing style diversity in LLM evaluation benchmarks and experimentally demonstrate the vulnerability of LLM to non-standard input.
We present a method to analyze the impact of writing style variations on LLM performance through persona-based LLM prompting.
We found that specific writing styles lead to consistent performance changes regardless of model type, size, or recency.
We present a scalable approach to improve the validity of LLM assessments by extending existing benchmarks.
Limitations:
There may be a lack of in-depth analysis of specific writing styles and how each style impacts performance.
The scope of the LLM model and tasks used in the experiment may be limited.
Further research is needed on the optimal settings for persona-based prompting (e.g., number of personas, level of detail in personality descriptions).
Further verification of the proposed methodology's practical benchmark application and effectiveness is required.
👍