This paper proposes SeeDo, a novel method for interpreting human demonstration videos and generating robot task plans using a Vision Language Model (VLM). SeeDo is a pipeline that integrates keyframe selection, visual recognition, and VLM inference. It allows a robot to perform a task plan by viewing a human demonstration video (See) and then explaining the plan to the robot (Do). We construct a dataset of diverse pick-and-place task demonstration videos and experimentally validate the superior performance of SeeDo by comparing it with several state-of-the-art video-input VLM-based baseline models. We deploy the generated task plans in simulation environments and on a real robot arm.