In this paper, we present a study on encoderless VLMs that are rapidly closing the performance gap with encoder-based vision-language models (VLMs). We systematically analyze the performance gap between conventional encoder-based VLMs and encoder-less VLMs using pre-trained vision encoders, discrete tokenizers, and minimal visual hierarchies, and deeply explore the features of encoderless VLMs. Through this, we develop an efficient strategy that is comparable to encoder-based VLMs and present an improved encoderless VLM, EVEv2.0. EVEv2.0 appropriately decomposes and hierarchically concatenates visual and language information to reduce inter-modal interference, and uses training strategies for effective optimization. Experimental results demonstrate that EVEv2.0 demonstrates excellent data efficiency and powerful visual inference capabilities.