In this paper, we propose document-level Knowledge Modules (KMs) to address the challenges of dynamically integrating new or rapidly changing information into pre-trained large-scale language models, especially when data is scarce or when dealing with personal and professional documents. KMs are lightweight components implemented as parameter-efficient LoRA modules that are trained to store information about new documents and can be easily integrated into the model as needed. We point out the limitations of existing next-token prediction methods and propose a Deep Context Distillation method that simulates the hidden state and logits of a teacher model instead. We show that it outperforms standard next-token prediction and pre-instruction training techniques on two datasets, and also highlight the synergy between KMs and RAG.