Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Augmented Adversarial Trigger Learning

작성자
  • Haebom

Author

Zhe Wang, Yanjun Qi

Outline

To overcome the limitations of existing adversarial trigger learning (ATLA), this paper proposes Adversarial Trigger Learning with Augmented Objectives (ATLA). ATLA improves the existing negative log-likelihood loss function to a weighted loss function, ensuring that learned adversarial triggers are more optimized for response-type tokens. This allows adversarial triggers to be learned with just a single question-response pair, ensuring good generalization to other similar queries. Furthermore, trigger optimization is enhanced by adding an auxiliary loss function that suppresses evasive responses. Experimental results demonstrate that ATLA outperforms existing state-of-the-art techniques, achieving a near-100% success rate while requiring 80% fewer queries. The learned adversarial triggers also generalize well to new queries and LLMs. The source code is publicly available.

Takeaways, Limitations

Takeaways:
Adversarial trigger learning is possible with a single question-answer pair.
Achieve higher success rates and efficiency compared to existing methods (80% fewer queries)
High generalization performance and transfer learning potential of learned triggers
Effective in exploiting LLM vulnerabilities and extracting system prompts
Ensuring reproducibility through open source code
Limitations:
Generalization performance for specific LLMs or types of queries may require further study.
Optimization research is needed on the design and weight adjustment of auxiliary loss functions.
ATLA's robustness against new defense techniques needs to be evaluated.
👍