This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Previous LLM safety research has mainly focused on injecting safe behaviors during the training phase, but recent studies have shown that these methods are vulnerable to various jailbreak attacks. At the same time, inference scaling has greatly improved the LLM inference performance, but its safety guarantee aspect has not been studied yet. This study pioneers inference scaling for robust and effective LLM safety against emerging threats. We show that existing inference scaling techniques are successful in inference tasks, but perform poorly in the safety context, even worse than basic approaches such as best-of-N sampling. This inefficiency is due to a newly identified challenge called the exploration-efficiency dilemma caused by the high computational overhead associated with evaluating the process reward model (PRM). To overcome this dilemma, this study proposes SAFFRON, a novel inference scaling paradigm specifically designed for safety guarantees. The core of our approach is the introduction of a multi-branch reward model (MRM), which significantly reduces the number of reward model evaluations. To realize this paradigm, we additionally propose (i) a partially supervised training objective for MRM, (ii) conservative search constraints to prevent out-of-distribution exploration, and (iii) a Trie-based key-value caching strategy to facilitate cache sharing between sequences during tree traversal. We verify the effectiveness of our method through extensive experiments. In addition, we release a token-level safety reward dataset (Safety4M) along with a trained multi-branch reward model (Saffron-1) to accelerate future LLM safety research. The code, model, and data are available at https://github.com/q-rz/saffron , and the project homepage is https://q-rz.github.io/p/saffron .
Overcoming the limitations of existing LLM safety research and suggesting a new safety improvement method through inference extension.
◦
Raises a new challenge called the search-efficiency dilemma and proposes a solution (SAFFRON).
◦
Innovative techniques are presented, including multi-branch reward model (MRM) and partially supervised training, conservative search constraints, and Trie-based caching strategies.
◦
Support for follow-up research through the release of the Saffron-1 model and Safety4M dataset.
•
Limitations:
◦
The effectiveness of SAFFRON is based on experimental results for specific datasets and models, and generalization performance in other environments requires further study.
◦
The design and training process of the multi-branch reward model is complex, which may make it difficult to implement and utilize.
◦
Possibility of not fully considering the diversity of jailbreak attacks. Possible vulnerability to new types of attacks.