This paper proposes a memory augmentation architecture and pre-training strategy to address the problem of modern language models' performance reliance on parameter scaling. This architecture allows small language models to access a large hierarchical parametric memory bank that encodes world knowledge. During pre-training and inference, small memory blocks are retrieved and added to the model based on context. This allows long-tailed world knowledge to be stored in the memory parameters, while the small language model learns to handle general knowledge and general inference capabilities. Experimentally, we demonstrate that adding 18M parameter memory to a 160M parameter model, drawing from a 4.6B memory bank, achieves performance comparable to that of a typical model with more than twice the number of parameters. We also investigate the optimal parametric memory type and size for a Transformer model, scaling it up to 21B parameters. We demonstrate that the proposed hierarchical feedforward memory performs robustly in a Transformer architecture, whether pre-trained or added post-trained.