Sign In

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Created by
  • Haebom
Category
Empty

์ €์ž

J Alex Corll

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ LLM์˜ ๋‹คํšŒ์ฐจ ํ”„๋กฌํ”„ํŠธ ์ฃผ์ž… ๊ณต๊ฒฉ์„ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ก์‹œ ๋ ˆ๋ฒจ ์ ์ˆ˜ํ™” ๊ณต์‹์ธ "Peak + Accumulation"์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณต์‹์€ ๊ฐœ๋ณ„ ํ„ด์˜ ์œ„ํ—˜ ์ ์ˆ˜๋ฅผ LLM ์—†์ด ํ†ตํ•ฉํ•˜์—ฌ ๋Œ€ํ™” ์ „์ฒด์˜ ์œ„ํ—˜ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉฐ, ๊ธฐ์กด ๊ฐ€์ค‘ ํ‰๊ท  ๋ฐฉ์‹์˜ ๊ทผ๋ณธ์ ์ธ ๊ฒฐํ•จ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ์‹์€ 90.8%์˜ ๋†’์€ ์žฌํ˜„์œจ๊ณผ 1.20%์˜ ๋‚ฎ์€ ์˜คํƒ์œจ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๊ณต๊ฒฉ ํƒ์ง€์— ํšจ๊ณผ์ ์ž„์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋‹คํšŒ์ฐจ ๊ณต๊ฒฉ ํƒ์ง€ ๋Šฅ๋ ฅ ํ–ฅ์ƒ: ๊ธฐ์กด ๋‹จ์ผ ํ„ด ํƒ์ง€์— ์ง‘์ค‘ํ–ˆ๋˜ ์—ฐ๊ตฌ์™€ ๋‹ฌ๋ฆฌ, ๋‹คํšŒ์ฐจ์— ๊ฑธ์ณ ์€๋ฐ€ํ•˜๊ฒŒ ์ง„ํ–‰๋˜๋Š” ํ”„๋กฌํ”„ํŠธ ์ฃผ์ž… ๊ณต๊ฒฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์งˆ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
LLM ํ˜ธ์ถœ ์—†๋Š” ํšจ์œจ์ ์ธ ์Šค์ฝ”์–ด๋ง: ํƒ์ง€ ๊ณผ์ •์—์„œ ์ถ”๊ฐ€์ ์ธ LLM ํ˜ธ์ถœ์„ ์š”๊ตฌํ•˜์ง€ ์•Š์•„ ๋น„์šฉ ๋ฐ ์ง€์—ฐ ์‹œ๊ฐ„ ์ธก๋ฉด์—์„œ ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
โ€ข
๋ณด์•ˆ ์œ„ํ—˜ ๋ชจ๋ธ๋ง์˜ ํ™•์žฅ: ๋ณ€ํ™”์  ํƒ์ง€, ๋ฒ ์ด์ง€์•ˆ ์ถ”๋ก , ๋ณด์•ˆ ์œ„ํ—˜ ๊ธฐ๋ฐ˜ ๊ฒฝ๋ณด ๋“ฑ ๋‹ค์–‘ํ•œ ๊ฐœ๋…์„ ์œตํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ณด์•ˆ ํƒ์ง€ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฏผ๊ฐ๋„ ๋ถ„์„์˜ ์ค‘์š”์„ฑ: ์ œ์•ˆ๋œ ๊ณต์‹์˜ ํ•ต์‹ฌ ํŒŒ๋ผ๋ฏธํ„ฐ(rho)์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„ ๋ถ„์„์„ ํ†ตํ•ด ํŠน์ • ์ž„๊ณ„๊ฐ’์—์„œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํžˆ ํ–ฅ์ƒ๋˜๋Š” ํ˜„์ƒ์„ ๋ฐœ๊ฒฌํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ์‹ค์ œ ์‹œ์Šคํ…œ ์ ์šฉ ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์˜ ์ค‘์š”์„ฑ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์˜คํ”ˆ ์†Œ์Šค ๊ณต๊ฐœ: ์•Œ๊ณ ๋ฆฌ์ฆ˜, ํŒจํ„ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ํ‰๊ฐ€ ๋„๊ตฌ๋ฅผ ๊ณต๊ฐœํ•˜์—ฌ ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ๋ฐœ์ „์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ƒˆ๋กœ์šด ์œ„ํ˜‘์— ๋Œ€ํ•œ ์ง€์†์ ์ธ ์—ฐ๊ตฌ ํ•„์š”: ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด ํšจ๊ณผ์ ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ํ”„๋กฌํ”„ํŠธ ์ฃผ์ž… ๊ณต๊ฒฉ์€ ๊ณ„์† ์ง„ํ™”ํ•˜๋ฏ€๋กœ ์ƒˆ๋กœ์šด ๊ณต๊ฒฉ ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ์ง€์†์ ์ธ ํƒ์ง€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘