Sign In

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou, Hua Dai, Fu Xiao

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์•ˆ์ „ ์ •๋ ฌ์ด ์œ ํ•ด ๋ฏธ์„ธ ์กฐ์ •(HFT)์— ์ทจ์•ฝํ•˜๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ์–ด ๊ธฐ๋ฒ•์ด ๊ณ ์ฐจ์› ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์˜ ๋ถˆํ•„์š”ํ•œ ์ค‘๋ณต์„ฑ์„ ์ด์šฉํ•˜์—ฌ ํšŒํ”ผ๋˜๋Š” ์ ์— ์ฐฉ์•ˆํ•˜์—ฌ, ๋ณธ ์—ฐ๊ตฌ๋Š” ์œ ํ•ดํ•œ ์ฟผ๋ฆฌ์˜ ์ตœ์ข… ์€๋‹‰ ์ƒํƒœ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ •๋ ฌ๋œ ๋ชจ๋ธ์˜ ์€๋‹‰ ์ƒํƒœ์— ๊ณ ์ •์‹œํ‚ค๋Š” '์•ˆ์ „ ๋ณ‘๋ชฉ ํ˜„์ƒ ์ •๊ทœํ™”(SBR)'๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ์–ด ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ๋‹จ ํ•˜๋‚˜์˜ ์•ˆ์ „ ์•ต์ปค๋งŒ์œผ๋กœ๋„ ์œ ํ•ด ์ ์ˆ˜๋ฅผ ํ˜„์ €ํžˆ ๋‚ฎ์ถ”๋ฉด์„œ๋„ ์ •์ƒ์ ์ธ ์ž‘์—… ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LLM์˜ ์•ˆ์ „ ์ •๋ ฌ์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ฐจ์›์˜ ๋ฐฉ์–ด ์ „๋žต์œผ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์ด ์•„๋‹Œ ๊ธฐํ•˜ํ•™์  ๋ณ‘๋ชฉ ์ง€์ ์„ ํ™œ์šฉํ•˜๋Š” ํ˜์‹ ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋‹จ์ผ ์•ˆ์ „ ์•ต์ปค๋งŒ์œผ๋กœ๋„ HFT ๊ณต๊ฒฉ์— ๋Œ€ํ•ด ๊ฐ•๋ ฅํ•œ ๋ฐฉ์–ด ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ์ •์ƒ์ ์ธ LLM ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์‹ค์šฉ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
'์•ˆ์ „ ์•ต์ปค'์˜ ์ตœ์  ๊ฐœ์ˆ˜ ๋ฐ ๋‹ค์–‘ํ•œ HFT ๊ณต๊ฒฉ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ, ๊ทธ๋ฆฌ๊ณ  SBR์ด LLM์˜ ์ „๋ฐ˜์ ์ธ ๋Šฅ๋ ฅ์— ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š” ์ž ์žฌ์  ์˜ํ–ฅ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘