To address the problem of generating a large number of visual tokens in high-resolution image processing, this paper proposes AVG-LLaVA, a large-scale multimodal model (LMM) that adaptively selects visual granularity based on input images and directives. AVG-LLaVA generates visual tokens of various granularities through multiple pooling layers and selects an appropriate granularity using a visual granularity router consisting of a Transformer, an MLP, and a voter layer. Furthermore, we present RGLF, a novel training method that aligns the router's predictions with the LMM's preferences without requiring additional manual annotation. Experimental results show that AVG-LLaVA achieves excellent performance on 11 benchmarks, significantly reduces the number of visual tokens, and improves inference speed (e.g., an 85.3% reduction in visual tokens and a 2.53x increase in inference speed on the AI2D benchmark).