In this paper, we propose a segmentation-and-summarization (DaS) framework for dense video caption generation for long unstructured videos. First, we segment each long video into multiple event proposals, and each event proposal consists of a set of short video segments. We extract visual features (e.g., C3D features) from each segment and generate a one-sentence description for each segment using traditional image/video captioning methods. Considering that the generated sentences contain rich semantic descriptions of the entire event proposal, we formulate the dense video captioning task as a sentence summarization problem using visual cues, and propose a novel two-stage long short-term memory (LSTM) method with a novel hierarchical attention mechanism to summarize all generated sentences into a single descriptive sentence with the help of visual features. Specifically, the first-stage LSTM network acts as an encoder that takes all semantic words of the generated sentences and visual features of all segments in an event proposal as inputs, and effectively summarizes the semantic and visual information related to this event proposal. The second-stage LSTM network acts as a decoder that takes as input the output of the first-stage LSTM network and the visual features of all video segments in an event proposal and generates a descriptive sentence for this event proposal. We demonstrate the effectiveness of the newly proposed DaS framework through comprehensive experiments on the ActivityNet Captions dataset.