In this paper, we propose Instruction Visual Grounding (IVG) to solve the visual grounding problem in synthetic images such as graphical user interfaces (GUIs). Unlike previous visual grounding studies that mainly focus on realistic images, this paper focuses on finding the coordinates of target elements of command execution by receiving natural language commands and GUI screens as input. To this end, we propose two methods: IVGocr, which combines LLMs, object detection models, and OCR modules, and IVGdirect, an end-to-end approach using a multi-modal architecture. We also release dedicated datasets for each method. In addition, we propose CPV, a new evaluation metric that relaxes the existing CPS metric, and release the final test dataset to support future research.