This paper studies test-time computations that leverage external tools and other deep learning models to improve the performance of large-scale language models (LLMs). While existing methods for integrating non-text modal representations into LLMs require costly supervised learning, our proposed In-Context Representation Learning (ICRL) adaptively utilizes non-text modal representations through small-shot learning. Unlike existing contextual learning, ICRL uses representations of underlying models (FMs) instead of text-label pairs to perform multimodal inference without fine-tuning. We evaluate ICRL's feasibility on several molecular tasks and investigate how FM representations are mapped onto LLMs, factors affecting ICRL performance, and the mechanisms underlying ICRL's effectiveness. ICRL is the first learning-free framework to integrate non-text modal representations into text-based LLMs, offering a promising direction for adaptive multimodal generalization.