This paper highlights the importance of text-based models in molecular-based models that accelerate the advancement of molecular science and molecular design. Existing models are limited by closed-vocabulary tokenizers that capture only a portion of the molecular space. This study systematically evaluates 34 tokenizers, including 19 chemical-specific tokenizers, and reveals significant differences in their applicability to SMILES molecular representations. To evaluate the impact of tokenizer selection, we introduce an n-gram language model as a low-cost proxy, and pre-train and fine-tune 18 RoBERTa-style encoders to verify their effectiveness in molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers, Smirk and Smirk-GPE, that fully support the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom, enabling applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.