This paper proposes an environment prompt injection attack (EnvInjection) against multimodal large-scale language models (MLLMs)-based web agents that interact with webpage environments. To overcome the limitations of existing attacks, including their effectiveness and stealth, and their impracticality in real-world environments, we present a novel attack technique that perturbs the raw pixel values of rendered webpages to induce the web agent to perform a specific action (target action) selected by the attacker. To overcome the difficulty of the non-differentiable mapping between raw pixel values and screenshots, we train a neural network that approximates the mapping and apply projected gradient descent to solve the optimization problem. Extensive evaluation on diverse webpage datasets demonstrates that EnvInjection outperforms existing baseline models.