In this paper, we propose a novel statistical framework, FiSCo (Fine-grained Semantic Computation), to assess the inherent bias in responses of large-scale language models (LLMs). To address the problem that existing evaluation methods overlook the bias in long-form responses and the inherent variability in LLM outputs, FiSCo assesses group-level fairness by detecting subtle semantic differences in long-form responses across demographic groups. Unlike previous studies that focus on sentiment or token-level comparisons, FiSCo operates at the claim level and goes beyond superficial analysis by leveraging implicit checks to assess semantic consistency across responses. It decomposes model outputs into semantically distinct claims and applies statistical hypothesis testing to compare between- and within-group similarities, enabling robust detection of subtle biases. We formalize a novel group-counter-empirical definition of fairness and validate FiSCo on synthetic and human-annotated datasets that include gender, race, and age. Experimental results show that FiSco outperforms various evaluation metrics and more reliably identifies subtle biases while reducing the impact of stochastic LLM variability.