This paper presents VITA (VIsion-To-Action policy), a noise-free, unconditional policy learning framework that directly maps visual information to actions. VITA uses flow matching to treat latent visual representations as sources of flow, eliminating conditioning mechanisms and reducing time and memory overhead. Because actions are lower-dimensional, unstructured, and sparse than visual representations, we introduce an action autoencoder to map raw actions into a structured latent space aligned with the visual latent. Furthermore, to prevent latent space collapse, we propose flow latent decoding, which backpropagates the action reconstruction loss through a flow-matching ODE step. VITA outperforms existing generative policies in simulations and real-world environments, achieving 1.5-2.3 times faster inference speed than existing methods using conditioning.