Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Soft Tokens, Hard Truths

Created by
  • Haebom

Author

Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier

Outline

This paper presents an inference approach for large-scale language models (LLMs) that uses continuous tokens instead of discrete tokens in the Chain-of-Thought (CoT) phase. Based on the intuition that continuous token mixtures can simultaneously simulate the overlapping of multiple inference paths, it has been theoretically proven that continuous tokens have significantly greater expressive power and can solve certain problems more efficiently. However, previous studies have either used continuous tokens only during inference on pre-trained discrete token models, or the computational cost of distilling continuous CoTs from the actual discrete CoT has limited the number of tokens in the CoT. This study presents the first scalable method for learning continuous CoTs via reinforcement learning (RL) without distillation from a baseline discrete CoT. By using "soft" tokens—i.e., token mixtures and noise in input embeddings—in RL exploration, we minimize computational overhead and enable learning continuous CoTs with hundreds of tokens. On mathematical inference benchmarks using Llama and Qwen models (up to 8B), we demonstrate that training with continuous CoT achieves comparable performance to discrete token CoT at pass@1 and outperforms it at pass@32, generating a wider variety of CoTs. Optimal performance is achieved when training with continuous CoT tokens and using discrete tokens for inference, implying that the "soft" model can be deployed in a standard manner. Finally, we demonstrate that continuous CoT RL training better preserves the base model's predictions on out-of-domain tasks, providing a gentler influence on the base model.

Takeaways, Limitations

Takeaways:
We present a scalable method for efficiently learning continuous CoTs using reinforcement learning.
Continuous CoT learning possible using hundreds of tokens
Improved performance and diversity compared to discrete token CoT in mathematical reasoning benchmarks (particularly pass@32).
Training with continuous CoT and then inferring with discrete tokens shows the best performance.
Better preserve the predictions of the underlying model in out-of-domain tasks.
Limitations:
Currently, only results for mathematical reasoning benchmarks are presented. Generalizability to other types of tasks requires further research.
Only experimental results up to the 8B model are presented. Scalability to larger models is required.
Lack of detailed explanation of the definition of "soft" tokens and how noise is added. Lack of detailed explanation of hyperparameter optimization.
👍