This paper proposes Visual-Subtitle Integration (VSI), an efficient keyframe retrieval method for long-form video understanding. To address the limitations of existing keyframe retrieval methods, which include weak multimodal alignment between textual queries and visual content and failure to capture complex temporal semantic information, VSI integrates subtitles, timestamps, and scene boundaries into a unified multimodal retrieval process. It leverages both visual and complementary textual information in video frames through a video retrieval stream and a subtitle matching stream, and enhances keyframe retrieval accuracy through the interaction of the two streams. On the LongVideoBench dataset, VSI significantly outperforms competing methods in keyframe location accuracy and the long-form video question-answering (Video-QA) task, achieving state-of-the-art performance.