In this paper, we propose a novel framework, BiMa, to address the visual-linguistic bias problem in text-to-video retrieval (TVR) systems. BiMa focuses on mitigating bias in both the visual representation of videos and the linguistic representation of texts. For visual mitigation of videos, we identify relevant objects, objects, and activities in videos to generate scene elements and integrate them into video embeddings to highlight fine and important details. For linguistic mitigation of texts, we introduce a mechanism to separate text features into content and bias elements so that the model can focus on meaningful content. Through extensive experiments and ablation studies on five major TVR benchmarks (MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo), we verify BiMa’s competitive performance and bias mitigation ability. In particular, we demonstrate strong results on out-of-distribution retrieval tasks, demonstrating its bias mitigation ability.