Existing causal-based Video Question Answering (VideoQA) models struggle with high-level inference and tend to rely on opaque, monolithic pipelines that intertwine video understanding, causal inference, and answer generation. These black-box approaches have limited interpretability and tend to rely on superficial heuristics. In this paper, we propose a novel modular framework that explicitly separates causal inference from answer generation. By introducing natural language causal chains as interpretable intermediate representations, we enable transparent and logically consistent inference through structured causal sequences that connect low-level video content and high-level causal inference. The two-stage architecture consists of a causal chain extractor (CCE), which generates causal chains from video-question pairs, and a causal chain-based answerer (CCDA), which generates answers based on these chains. To address the lack of annotated inference traces, we propose a scalable method for generating high-quality causal chains from existing datasets using large-scale language models. We also propose a novel evaluation metric for causally oriented captions, CauCo. Experiments on three large-scale benchmarks demonstrate that the proposed approach not only outperforms state-of-the-art models but also offers significant advantages in explainability, user trust, and generalization, establishing CCE as a reusable causal inference engine across diverse domains.