Veo 3 is a large-scale generative model trained on web-scale data. Similar to a language model (LLM), it demonstrates zero-shot capability, capable of performing a wide range of tasks without being specialized for a specific task. Veo 3 performs a wide range of tasks, including object segmentation, edge detection, image editing, physical property understanding, object affordance recognition, and tool use simulation. It is also capable of early forms of visual reasoning, such as maze and symmetry solving. These capabilities suggest the potential for video models to evolve into general-purpose vision models.