Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Large Language Models Implicitly Learn to See and Hear Just By Reading

Created by
  • Haebom

Author

Prateek Verma, Mert Pilanci

Outline

This paper presents the surprising discovery that training an autoregressive large-scale language model (LLM) with text tokens inherently develops the ability to understand images and audio, enabling the ability to see and hear simply by reading. While existing audio and visual LLM models fine-tune text LLMs to generate text outputs based on image and audio embeddings, our architecture takes image patches, audio waveforms, or tokens as input and outputs embeddings or category labels for a classification pipeline. This study demonstrates the generality of text weights in aiding audio classification on the FSD-50K and GTZAN datasets, and demonstrates their utility in image classification and image patches on CIFAR-10 and Fashion-MNIST. This demonstrates the concept that, rather than training a model from scratch each time, the text LLM learns a robust internal circuit that can be leveraged by activating connections needed for a variety of applications.

Takeaways, Limitations

Takeaways:
By demonstrating that text LLM can independently develop image and audio comprehension capabilities, we present new possibilities for developing models that handle multiple modalities.
We present a multi-modality processing method that is more efficient than existing model fine-tuning methods.
Expand the potential of your text LLM and enhance its applicability in a wide range of applications.
Limitations:
More extensive experiments and analysis are needed to evaluate the generalization ability and performance of the proposed architecture.
Further research is needed on diverse datasets and application areas.
We need a more detailed explanation of the mechanisms that internally understand images and audio.
👍