Large-scale language models (LLMs) often generate responses with inherent biases, compromising their reliability in real-world applications. Existing evaluation methods often overlook the inherent biases in long-form responses and the inherent variability in LLM output. To address these challenges, this paper proposes Fine-Grained Semantic Comparison (FiSCo), a novel statistical framework for assessing group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike previous studies that focus on sentiment or token-level comparisons, FiSCo analyzes responses at the semantic level by leveraging implication checks to assess semantic consistency. It decomposes model outputs into semantically distinct claims and applies statistical hypothesis testing to compare between- and within-group similarities, enabling robust detection of subtle biases. We formalize a novel definition of group counterfactual fairness and validate FiSCo on synthetic and human-annotated datasets that include gender, race, and age. Experimental results demonstrate that FiSCo outperforms various evaluation metrics in identifying subtle biases more reliably while mitigating the impact of stochastic LLM variability.