This paper addresses the lack of understanding of neural networks' learning difficulties and catastrophic forgetting (CF) in non-stationary environments. We systematically investigate the effects of model size and the extent of feature learning on continuous learning. We reconcile conflicting findings from previous research by distinguishing between lazy and rich learning approaches through parameterization of the architecture. We demonstrate that increasing model width is beneficial only when it reduces the amount of feature learning, thereby increasing lazy learning. Using the framework of dynamical mean field theory, we study the infinite-width dynamics of models in the feature learning space and characterize CF by extending previous theoretical findings limited to the lazy learning space. We investigate the complex relationships among feature learning, task non-stationarity, and forgetting, finding that high feature learning is beneficial only for similar tasks. We demonstrate a transfer mediated by task similarity, where models effectively exit the lazy learning space with low forgetting and enter the rich learning space with significant forgetting. Finally, we demonstrate that neural networks achieve optimal performance at an optimal level of feature learning, which varies with task non-stationarity, and that this transfer holds true across model sizes. This study provides an integrated perspective on the role of scale and feature learning in persistent learning.