This paper proposes a system for efficient understanding of long-form video data. To address the high dimensionality and density of long-form video, we propose a method that incrementally builds memory by dividing the video into short chunks and generating text-based summaries (captions) for each chunk. To overcome the limitations of simple action captioning, we enrich the memory by adding static scene descriptions using the Vision Language Model (VLM). Combining the LaViLa video captioning model with a large-scale language model (LLM), we build a video question-answering system. By integrating the video segmentation method with the VLM, we improve the quality of the caption logs and the range of answerable questions. Finally, we fine-tune the LaViLa model to generate both action and scene captions, thereby enhancing the efficiency of the captioning pipeline. We also develop a controllable hybrid captioner that switches caption types based on scene changes using special tokens.