Effectively integrating diverse sensory information is crucial for robotic manipulation. However, existing feature linking approaches have the problem that dominant sensory information, such as vision, overwhelms tactile information, which is crucial for touch-related tasks. This paper proposes a method that decomposes the policy into a set of diffusion models specialized for each representation (e.g., vision or tactile) and adaptively combines the contributions of each model using a router network. This allows for the gradual integration of new representations. We demonstrate that our approach outperforms baseline feature linking approaches on simulated RLBench tasks and real-world tasks such as occluded object grasping, spoon reordering in the hand, and puzzle insertion. We also demonstrate robustness to physical disturbances and sensor damage, and perform an importance analysis demonstrating adaptive shifting between sensory inputs.