This paper presents a Vision-Language-Action (VLA) model that promises to leverage rich pre-trained representations to build generic robots across diverse tasks and environments. While existing VLA models are fine-tuned from vision-language models (VLMs), direct fine-tuning on robot data often hinders these representations and limits generalization. This study presents a framework that adapts pre-trained features for robot manipulation while retaining them. It consists of three components: (i) a dual encoder design with a fixed vision encoder to maintain pre-trained features and another trainable encoder for task adaptation; (ii) a string-based action tokenizer that converts continuous actions into character sequences aligned with the model's pre-training domain; and (iii) a joint training strategy that combines a vision-language dataset with robot demonstrations, emphasizing spatial reasoning and context. Evaluation results from simulations and real robots demonstrate that the proposed method improves robustness to visual distractions, generalization to new instructions and environments, and overall task success rate compared to baselines.