Sign In

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fernandez Fisac, Andrea Bajcsy

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด AI ์•ˆ์ „ ์žฅ์น˜๋Š” ์œ ํ•ด ์ฝ˜ํ…์ธ  ์ฐจ๋‹จ์— ์ดˆ์ ์„ ๋งž์ถ”์—ˆ์œผ๋‚˜, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ AI ์‹œ์Šคํ…œ์˜ ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ๊ธˆ์ „์ , ๋ฌผ๋ฆฌ์  ํ”ผํ•ด์™€ ๊ฐ™์€ ํ•˜๋ฅ˜ ์œ„ํ—˜์„ ์˜ˆ๋ฐฉํ•˜๋Š” ๋ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” AI ์•ˆ์ „ ๋ฌธ์ œ๋ฅผ ์ˆœ์ฐจ์  ์˜์‚ฌ๊ฒฐ์ • ๋ฌธ์ œ๋กœ ์ •์˜ํ•˜๊ณ , ์•ˆ์ „ ๋น„ํŒ ์ œ์–ด ์ด๋ก ์„ AI ๋ชจ๋ธ์˜ ์ž ์žฌ ํ‘œํ˜„ ๊ณต๊ฐ„์— ์ ์šฉํ•˜์—ฌ ์‹ค์‹œ๊ฐ„์œผ๋กœ AI์˜ ์œ„ํ—˜ํ•œ ์ถœ๋ ฅ์„ ๊ฐ์ง€ํ•˜๊ณ  ์ด๋ฅผ ์•ˆ์ „ํ•œ ์ถœ๋ ฅ์œผ๋กœ ๋Šฅ๋™์ ์œผ๋กœ ๊ต์ •ํ•˜๋Š” ์ œ์–ด ์ด๋ก  ๊ธฐ๋ฐ˜์˜ AI ์•ˆ์ „ ์žฅ์น˜๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
AI ์•ˆ์ „์˜ ๋™์  ์ „ํ™˜: AI ์‹œ์Šคํ…œ์˜ ์ง€์†์ ์ธ ์ƒํ˜ธ์ž‘์šฉ๊ณผ ๊ทธ๋กœ ์ธํ•œ ๊ฒฐ๊ณผ์— ์ฃผ๋ชฉํ•˜์—ฌ, ์ •์ ์ธ ์ฝ˜ํ…์ธ  ์ฐจ๋‹จ์„ ๋„˜์–ด์„  ๋™์ ์ด๊ณ  ์˜ˆ์ธก์ ์ธ AI ์•ˆ์ „ ์žฅ์น˜์˜ ํ•„์š”์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ฒ”์šฉ์ ์ธ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ: ํŠน์ • AI ๋ชจ๋ธ์— ๊ตญํ•œ๋˜์ง€ ์•Š๊ณ  ๋‹ค์–‘ํ•œ AI ๋ชจ๋ธ์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ ๋ถˆ๋ณ€(model-agnostic) ๋ฐฉ์‹์˜ ์•ˆ์ „ ์žฅ์น˜๋ฅผ ๊ฐœ๋ฐœํ•˜์—ฌ ํ™•์žฅ์„ฑ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ›ˆ๋ จ ๋ฐ ๊ฒ€์ฆ: ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆ๋œ ์ œ์–ด ์ด๋ก  ๊ธฐ๋ฐ˜ ์•ˆ์ „ ์žฅ์น˜๊ฐ€ ์ž ์žฌ์ ์ธ ํŒŒ๊ตญ์  ๊ฒฐ๊ณผ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐฉ์ง€ํ•˜๋ฉด์„œ๋„ ์ž‘์—… ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•จ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ–ฅํ›„ ๊ณผ์ œ: ํ˜„์‹ค ์„ธ๊ณ„์—์„œ์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋”์šฑ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋ณต์žกํ•˜๊ณ  ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ์™€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘