This paper addresses the problem of natural language query-based modeling in graphical user interfaces (GUIs), which often suffer from a variety of visual elements, spatial clutter, and linguistic ambiguity. We present a training-free GUI-based framework, DiMo-GUI, which leverages two core strategies: dynamic visual-based and modality-aware optimization. Instead of processing the GUI as a single image, the input is split into textual and iconographic elements, and a common vision-language model is used to independently infer each modality. When the prediction is ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focus regions centered on the model's initial prediction and progressively expanding to subregions to improve the underlying results. This hierarchical refinement process helps resolve ambiguity in visually cluttered layouts without additional training or annotation. We evaluate our approach on standard GUI-based benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation and region-centric inference.