Long-form video data is extremely dense and high-dimensional. Text-based summaries of video content offer a way to express query-relevant content in a much more concise manner than raw video. Furthermore, text representations can be readily processed by state-of-the-art large-scale language models (LLMs), enabling inferences about video content to answer complex natural language queries. To address this challenge, we rely on a video captioner that incrementally builds a text-based memory, operating on shorter video chunks where spatiotemporal modeling is computationally feasible. We explore methods to improve the quality of activity logs composed of short video captions. Video captions tend to focus primarily on human actions, and questions may be related to other information in the scene. Therefore, we aim to use Vision Language Models (VLMs) to add static scene descriptions to the memory. Our video understanding system combines the LaViLa video captioner with LLMs to answer video questions. We first explore various methods for segmenting videos into meaningful segments to more accurately reflect the structure of the video content. Furthermore, we integrated static scene descriptions into the captioning pipeline using the LLaVA VLM, resulting in more detailed and complete caption logs and expanding the range of questions that can be answered from text memory. Finally, we fine-tuned the LaViLa video captioner to generate both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for both tasks. Our model, a controllable hybrid captioner, can alternate between different types of captions based on special input tokens that signal scene changes detected in the video.