Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Active Attacks: Red-teaming LLMs via Adaptive Environments

Created by
  • Haebom

Author

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim

Outline

We address the problem of generating diverse attack prompts that induce harmful behaviors for the safety fine-tuning of large-scale language models (LLMs). Instead of manually engineering prompts, we train an attacker LLM using reinforcement learning (RL) as a reward, using a toxicity classifier, to automatically generate these prompts. Inspired by the active learning paradigm, which encourages adaptive exploration, this paper introduces "Active Attacks," a novel RL-based red team algorithm that adapts attacks as the victim evolves. Active Attacks is a simple plug-and-play module that seamlessly integrates with existing RL objectives. It outperforms existing RL-based methods (including GFlowNets, PPO, and REINFORCE), improving the cross-attack success rate from 0.07% to 31.28% (with a 6% increase in computational effort) compared to the previous state-of-the-art GFlowNets.

Takeaways, Limitations

Takeaways:
Automatic generation of various attack prompts that can be used to fine-tune the security of LLM.
It shows superior performance compared to existing RL-based methods (more than 400x improvement over GFlowNets).
Active Attacks are simple plug-and-play modules that can be easily integrated into existing RL objectives.
Fine-tuning the victim's security to encourage attackers to continuously search for new vulnerabilities.
A progressive exploration curriculum that progresses from easy to difficult modes.
Discover various local attack modes step by step and combine them to cover a wide range of multi-mode distributions.
Limitations:
There is no specific mention of Limitations in the paper.
👍