This paper presents the surprising discovery that training an autoregressive large-scale language model (LLM) with text tokens inherently develops the ability to understand images and audio, enabling the ability to see and hear simply by reading. While existing audio and visual LLM models fine-tune text LLMs to generate text outputs based on image and audio embeddings, our architecture takes image patches, audio waveforms, or tokens as input and outputs embeddings or category labels for a classification pipeline. This study demonstrates the generality of text weights in aiding audio classification on the FSD-50K and GTZAN datasets, and demonstrates their utility in image classification and image patches on CIFAR-10 and Fashion-MNIST. This demonstrates the concept that, rather than training a model from scratch each time, the text LLM learns a robust internal circuit that can be leveraged by activating connections needed for a variety of applications.