Sign In

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Created by
  • Haebom
Category
Empty

์ €์ž

Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด LLM ๊ณต๊ฒฉ์ด ์ฃผ๋กœ ํ”„๋กฌํ”„ํŠธ๋‚˜ ์ž„๋ฒ ๋”ฉ ์ˆ˜์ค€์— ๋จธ๋ฌผ๋Ÿฌ ๊นŠ์€ ๋ชจ๋ธ ๊ตฌ์กฐ์˜ ์ทจ์•ฝ์ ์„ ๊ฐ„๊ณผํ•œ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” '์•ˆ์ „ ์ฃผ์˜ ํ—ค๋“œ ๊ณต๊ฒฉ(SAHA)'์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SAHA๋Š” ์‹ฌ์ธต ์ฃผ์˜ ํ—ค๋“œ์˜ ์ทจ์•ฝ์ ์„ ํƒ์ƒ‰ํ•˜๊ณ , 'Ablation-Impact Ranking' ์ „๋žต๊ณผ 'Layer-Wise Perturbation' ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๊ณต๊ฒฉ์˜ ํšจ๊ณผ๋ฅผ ๊ทน๋Œ€ํ™”ํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ 14% ๋†’์€ ๊ณต๊ฒฉ ์„ฑ๊ณต๋ฅ (ASR)์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ณต๊ฐœ๋œ LLM์˜ ์•ˆ์ „์„ฑ์€ ๊ฒ‰๋ณด๊ธฐ๋ณด๋‹ค ๊นŠ์€ ๋ชจ๋ธ ๊ตฌ์กฐ์˜ ์ทจ์•ฝ์ ์œผ๋กœ ์ธํ•ด ์‰ฝ๊ฒŒ ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด์— ๋Œ€ํ•œ ๋ฐฉ์–ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์‹ฌ์ธต ์ฃผ์˜ ํ—ค๋“œ์— ๋Œ€ํ•œ ๊ณต๊ฒฉ์€ LLM ๋ณด์•ˆ์„ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐ ์žˆ์–ด ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ฑ์„ ์ œ์‹œํ•˜๋ฉฐ, ์ž ์žฌ์ ์ธ ๋ณด์•ˆ ์ทจ์•ฝ์  ํƒ์ง€์— ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ SAHA ๋ฐฉ๋ฒ•๋ก ์€ ์ฃผ์˜ ํ—ค๋“œ ์ˆ˜์ค€์—์„œ์˜ ๊ณต๊ฒฉ์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋‚˜, ๋” ๊ด‘๋ฒ”์œ„ํ•œ LLM ๊ตฌ์กฐ์˜ ์ทจ์•ฝ์ ์„ ํƒ์ƒ‰ํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๋ฐฉ์–ด ์ „๋žต์„ ๊ฐœ๋ฐœํ•˜๋Š” ๋ฐ๋Š” ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘