This paper presents an inference approach for large-scale language models (LLMs) that uses continuous tokens instead of discrete tokens in the Chain-of-Thought (CoT) phase. Based on the intuition that continuous token mixtures can simultaneously simulate the overlapping of multiple inference paths, it has been theoretically proven that continuous tokens have significantly greater expressive power and can solve certain problems more efficiently. However, previous studies have either used continuous tokens only during inference on pre-trained discrete token models, or the computational cost of distilling continuous CoTs from the actual discrete CoT has limited the number of tokens in the CoT. This study presents the first scalable method for learning continuous CoTs via reinforcement learning (RL) without distillation from a baseline discrete CoT. By using "soft" tokens—i.e., token mixtures and noise in input embeddings—in RL exploration, we minimize computational overhead and enable learning continuous CoTs with hundreds of tokens. On mathematical inference benchmarks using Llama and Qwen models (up to 8B), we demonstrate that training with continuous CoT achieves comparable performance to discrete token CoT at pass@1 and outperforms it at pass@32, generating a wider variety of CoTs. Optimal performance is achieved when training with continuous CoT tokens and using discrete tokens for inference, implying that the "soft" model can be deployed in a standard manner. Finally, we demonstrate that continuous CoT RL training better preserves the base model's predictions on out-of-domain tasks, providing a gentler influence on the base model.