Large-scale language models (LLMs) often produce responses with inherent biases, which makes them unreliable in real-world applications. Existing evaluation methods often overlook the biases in long-form responses and the inherent variability in LLM outputs. To address these challenges, in this paper we propose a novel statistical framework, Fine-grained Semantic Computation (FiSCo), to assess group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike previous studies that focus on sentiment or token-level comparisons, FiSCo performs semantic unit analysis by leveraging implication checks to assess the consistency of meaning across responses. It decomposes model outputs into semantically distinct claims and applies statistical hypothesis testing to compare between- and within-group similarities to robustly detect subtle biases. We formalize a novel definition of group counterfactual fairness and validate FiSCo on synthetic and human-annotated datasets spanning gender, race, and age. Experimental results show that FiSco more reliably identifies subtle biases while reducing the impact of probabilistic LLM variability, outperforming various evaluation metrics.