Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Created by
  • Haebom

Author

Haotian Wu, Shufan Jiang, Mingyu Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Chengwei Qin

FURINA-Builder: Introducing the Role-Playing Benchmark Building Pipeline

Outline

This paper introduces FURINA-Builder, a novel multi-agent collaborative pipeline that automatically builds fully customizable RP benchmarks of arbitrary scale, addressing the limitations of existing benchmarks as advances in large-scale language models (LLMs) for role-playing (RP) tasks evolve. This pipeline is designed to evaluate arbitrary characters across a variety of scenarios and prompt formats, making it the first adaptive evaluation benchmark builder in the RP field. FURINA-Builder simulates conversations between a test character and other characters drawn from a well-constructed character-scene pool. An LLM examiner selects fine-grained evaluation dimensions and refines the test character's responses into final test utterances. Using this pipeline, the authors build FURINA-Bench, a new comprehensive role-playing benchmark featuring both real and synthetic test characters, each evaluated according to dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis validate the pipeline and benchmark design. Extensive evaluations against state-of-the-art LLMs reveal that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. In all models, the original characters consistently outperformed the synthetic characters, and the inference power further amplified this gap. Interestingly, we found that model size did not monotonically decrease hallucinations. More importantly, for the inference LLM, we discovered a new tradeoff: while inference improves RP performance, it also increases RP hallucinations. This tradeoff extends to a wider Pareto frontier between RP performance and reliability for all LLMs. These results demonstrate the effectiveness of FURINA-Builder and the challenges posed by FURINA-Bench.

Takeaways, Limitations

Takeaways:
FURINA-Builder addresses the limitations of existing benchmarks by presenting a novel pipeline that automatically builds custom RP benchmarks.
FURINA-Bench provides a new comprehensive RP benchmark that includes a variety of characters and evaluation criteria.
O3 and DeepSeek-R1 achieved the best performance on English and Chinese RP tasks, respectively.
A novel trade-off was discovered: reasoning ability improves RP performance, but increases hallucinations.
Limitations:
The fact that model size does not monotonically decrease hallucinations suggests room for improvement.
The trade-off between RP performance and reliability is a challenge that must be addressed in the development of LLM.
👍