This paper proposes a method to apply watermarking to benchmarks to address the problem of benchmark contamination, which poses a serious threat to the reliability of large-scale language model (LLM) evaluation. Watermarking is performed by reconstructing the original question into a watermarked LLM without compromising the usability of the benchmark. In the evaluation process, a theoretically supported statistical test is used to detect the “radioactivity”, which is the trace left by the text watermark during model training. A 1 billion-parameter model with 10 billion tokens is pre-trained from scratch, and the contamination detection effectiveness is verified on ARC-Easy, ARC-Challenge, and MMLU. As a result, the benchmark usability is similar after watermarking, and contamination detection is successful when the contamination is sufficient to improve performance (e.g., +5% improvement in ARC-Easy, p-value = 10⁻³).