In this paper, we propose an all-in-one video restoration framework that concatenates corruption-aware semantic contexts of video frames into natural language via a foundation model. Unlike previous works, we do not presuppose knowledge of corruptions at training or test time, and instead safely separate the foundation model during inference to learn an approximation of the foundation knowledge without additional cost. We also call for benchmark standardization in the field of all-in-one video restoration, and propose three-task (3D) and four-task (4D) benchmarks in multi-corruption settings, as well as two time-varying composite corruption benchmarks, one of which is a proposed dataset with various snow intensities that naturally simulates the impact of weather deterioration on videos. We compare our proposed method with previous works and report state-of-the-art performance on all benchmarks.