This paper considers the diverse tasks of generating natural language from image or video sequences as special cases of the more general problem of modeling the complex relationships between temporally unfolding visual events and the linguistic features used to interpret or describe them. While previous research has focused on various visual natural language processing tasks, the nature and extent of intermodal interactions have been lacking. Therefore, this paper presents five different tasks, examines the modeling and evaluation approaches used in these tasks, and identifies common challenges and future research directions.