Large-scale vision-language-action (VLA) models have been pre-trained on massive robot datasets and provide promising generalization policies for robot learning, but they still struggle with the spatiotemporal dynamics of interactive robotics, making them ineffective for handling complex tasks such as manipulation. In this study, we present visual tracking prompting, a simple yet effective approach to enhance the spatiotemporal awareness of VLA models by visually encoding state-action trajectories. Using our self-collected dataset of 150,000 robot manipulation trajectories, we develop a novel TraceVLA model by fine-tuning OpenVLA with visual tracking prompting. Evaluations of TraceVLA on 137 configurations of SimplerEnv and four tasks of a real WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real robot tasks, demonstrating robust generalization across a variety of implementations and scenarios. To further verify the effectiveness and generality of our method, we present a 4B Phi-3-Vision-based compact VLA model pre-trained on Open-X-Embodiment and fine-tuned on our study dataset. This model achieves comparable performance to the 7B OpenVLA baseline model while significantly improving the inference efficiency.