This paper studies the reliability of synthetic test collections generated using large-scale language models (LLMs). We investigate potential biases in synthetic test collections that utilize LLMs to generate queries, labels, or both, and analyze their impact on system evaluation. Our results demonstrate the presence of bias in evaluations using synthetic test collections, suggesting that while bias may impact absolute system performance measurements, it may be less significant in comparing relative system performance.