In this paper, we propose a dense video captioning task that detects and captions all events in long, unprocessed videos. To address the problem that existing methods do not sufficiently explore scene evolution within event-time proposals, resulting in poor performance when scenes and objects change in relatively long proposals, we propose a graph-based segmentation and summarization (GPaS) framework. In the 'segmentation' step, GPaS divides the entire event proposal into short video segments for generating captions at a more detailed level, and in the 'summarization' step, it summarizes the generated sentences containing rich descriptive information for each segment into a single sentence. In particular, focusing on the 'summarization' step, we propose a framework that effectively exploits the relationships between semantic words, treating semantic words as nodes in a graph, and learning interactions by combining Graph Convolutional Network (GCN) and Long Short Term Memory (LSTM) with the help of visual cues. To seamlessly integrate GCN and LSTM, we propose two GCN-LSTM Interaction (GLI) modules. We demonstrate the effectiveness of the proposed method through extensive comparisons with state-of-the-art methods on the ActivityNet Captions dataset and the YouCook II dataset.