In this paper, we show that the Vision Transformer (ViT), which shows excellent accuracy in medical image classification, has a semantically unclear representation due to its size and complex self-attention mechanism. Using a projected gradient-based algorithm, we show that the ViT representation is semantically fragile and sensitive to subtle changes. That is, images with imperceptible differences may have very different representations, while images that should belong to semantically different classes may have nearly identical representations. This vulnerability reduces the reliability of the classification results, and we show that even a slight change can decrease the classification accuracy by more than 60%. This is the first study to systematically demonstrate the semantic insufficiencies of the ViT representation in medical image classification, and presents important challenges for the application of ViT in safety-critical systems.