Large-scale language models (LLMs) demonstrate impressive performance when combined with reinforcement learning (RL), but leveraging this potential requires intermediate training steps. An effective intermediate training step should identify a condensed set of useful actions and enable rapid selection via online RL. This study presents the first theoretical results on how intermediate training shapes post-training. This step characterizes an action subspace that minimizes both the value approximation error due to pruning and the RL error during subsequent planning. The results reveal two key determinants of the effectiveness of intermediate training: the efficiency of pruning, which shapes the prior of the initial RL policy, and its impact on RL convergence, which controls the extent to which the policy can be refined through online interactions. Based on these findings, this study proposes Reasoning as Action Abstractions (RA3), a scalable intermediate training algorithm. RA3 derives a sequential variational lower bound and optimizes it by iteratively discovering temporally coherent latent structures via RL and fine-tuning them on bootstrapped data. We demonstrate the effectiveness of RA3 through experiments on code generation tasks.