Sign In

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Xiao Li, Wei Zhang, Zhuhong Li, Qiongxiu Li, Shei PernChua, BingZe Lee, Jinghao Cui, Yifan Huang, Xiaolin Hu

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์ •๋ ฌ๋œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ๋Œ€์ƒ์œผ๋กœ ํ•˜๋Š” ์ž๋™ํ™”๋œ ํƒˆ์˜ฅ ๊ณต๊ฒฉ ๋ฐฉ๋ฒ•์ธ GCG(Greedy Coordinate Gradient)์˜ ๋‚ฎ์€ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Faster-GCG๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Faster-GCG๋Š” ๋ถ€์ •ํ™•ํ•œ ๊ธฐ์šธ๊ธฐ ์ถ”์ •, ๋น„ํšจ์œจ์ ์ธ ์ƒ˜ํ”Œ๋ง, ๋ฐ˜๋ณต์ ์ธ ์ ‘๋ฏธ์‚ฌ ํ‰๊ฐ€๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ์ตœ๋Œ€ 8๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ 7๋ฐฐ ๋‹จ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด 5๊ฐœ์˜ LLM์—์„œ ํ‰๊ท  78.1%์˜ ํƒˆ์˜ฅ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, Qwen3.5-4B ๋ชจ๋ธ์—๋Š” 88.7%์˜ ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ•˜์—ฌ ์ตœ์‹  ํ™”์ดํŠธ๋ฐ•์Šค ํƒˆ์˜ฅ ๋ฐฉ๋ฒ•๋ก ์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
Faster-GCG๋Š” ๊ธฐ์กด GCG ๊ณต๊ฒฉ์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜์—ฌ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ LLM ํƒˆ์˜ฅ ๊ณต๊ฒฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๊ธฐ๋ฒ•(๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์ •๊ทœํ™”, ์˜จ๋„ ์ œ์–ด ์ƒ˜ํ”Œ๋ง, ๋ฐฉ๋ฌธ ์ ‘๋ฏธ์‚ฌ ํ‘œ์‹œ)์€ ์ด์‚ฐ์ ์ธ ์ตœ์ ํ™” ๋ฌธ์ œ ํ•ด๊ฒฐ์— ํšจ๊ณผ์ ์ธ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” LLM์˜ ์•ˆ์ „์„ฑ์„ ์œ„ํ˜‘ํ•˜๋Š” ํƒˆ์˜ฅ ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ํšจ๊ณผ์ ์ธ ๋ฐฉ์–ด ์ „๋žต ๊ฐœ๋ฐœ์˜ ํ•„์š”์„ฑ์„ ๋‹ค์‹œ ํ•œ๋ฒˆ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
Faster-GCG์˜ ์„ฑ๊ณต๋ฅ ์€ ์—ฌ์ „ํžˆ 100%๊ฐ€ ์•„๋‹ˆ๋ฉฐ, ํŠน์ • ๋ชจ๋ธ์ด๋‚˜ ๊ณต๊ฒฉ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ ๋” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ๊ณต๊ฒฉ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฒ€์ฆ ๋ฐ ๋ฐฉ์–ด ๊ธฐ๋ฒ• ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘