This paper introduces FURINA-Builder, a novel multi-agent collaborative pipeline that automatically builds fully customizable RP benchmarks of arbitrary scale, addressing the limitations of existing benchmarks as advances in large-scale language models (LLMs) for role-playing (RP) tasks evolve. This pipeline is designed to evaluate arbitrary characters across a variety of scenarios and prompt formats, making it the first adaptive evaluation benchmark builder in the RP field. FURINA-Builder simulates conversations between a test character and other characters drawn from a well-constructed character-scene pool. An LLM examiner selects fine-grained evaluation dimensions and refines the test character's responses into final test utterances. Using this pipeline, the authors build FURINA-Bench, a new comprehensive role-playing benchmark featuring both real and synthetic test characters, each evaluated according to dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis validate the pipeline and benchmark design. Extensive evaluations against state-of-the-art LLMs reveal that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. In all models, the original characters consistently outperformed the synthetic characters, and the inference power further amplified this gap. Interestingly, we found that model size did not monotonically decrease hallucinations. More importantly, for the inference LLM, we discovered a new tradeoff: while inference improves RP performance, it also increases RP hallucinations. This tradeoff extends to a wider Pareto frontier between RP performance and reliability for all LLMs. These results demonstrate the effectiveness of FURINA-Builder and the challenges posed by FURINA-Bench.