Sign In

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li

๐Ÿ’ก ๊ฐœ์š”

์ด ๋…ผ๋ฌธ์€ ๋‹ค์ค‘ ํ„ด ๋Œ€ํ™”์—์„œ ์€๋‹‰๋œ ์•…์˜์ ์ธ ์˜๋„๋ฅผ ํƒ์ง€ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ์–ด ๊ธฐ๋ฒ•์ธ TurnGate๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด LLM๋“ค์ด ์—ฌ๋Ÿฌ ํ„ด์— ๊ฑธ์ณ ๋ถ„์‚ฐ๋œ ์•…์˜์ ์ธ ์˜๋„์— ์ทจ์•ฝํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ์ธ์‹ํ•˜๊ณ , ๋Œ€ํ™”์˜ ์–ด๋А ์‹œ์ ์—์„œ ์‘๋‹ต์ด ํ•ด๋กœ์šด ํ–‰๋™์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ์กฐ๊ธฐ์— ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์•…์˜์ ์ธ ์˜๋„๊ฐ€ ๋ฐœํ˜„๋˜๋Š” ์ตœ์†Œํ•œ์˜ ํ„ด์„ ์‹๋ณ„ํ•˜๋Š” TurnGate์™€ ์ด๋ฅผ ํ›ˆ๋ จํ•˜๊ณ  ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ MTID ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐœ๋ฐœํ•˜์˜€์œผ๋ฉฐ, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ๋‚ฎ์€ ์˜ค์ฐจ์œจ๋กœ ํšจ๊ณผ์ ์ธ ์•…์˜์ ์ธ ์˜๋„ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋‹ค์ค‘ ํ„ด ๋Œ€ํ™”์—์„œ ์€๋‹‰๋œ ์•…์˜์ ์ธ ์˜๋„๋ฅผ ํƒ์ง€ํ•˜๋Š” ๋ฐ ์žˆ์–ด ํ„ด๋ณ„ ๊ฐœ์ž…์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
TurnGate๋Š” ๋‹ค์–‘ํ•œ ๊ณต๊ฒฉ ๋ฐฉ์‹, ๋„๋ฉ”์ธ, ๋Œ€์ƒ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ผ๋ฐ˜ํ™”๋œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์•ˆ์ „ํ•œ ๋Œ€ํ™” ์‹œ์Šคํ…œ ๊ตฌ์ถ•์— ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
MTID ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์ค‘ ํ„ด ์•…์˜์ ์ธ ์˜๋„ ํƒ์ง€ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ๊ท€์ค‘ํ•œ ์ž์›์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
โ€ข
์ž ์žฌ์ ์œผ๋กœ '์ •์ƒ์ ์ธ ํƒ์ƒ‰ ๋Œ€ํ™”'๋ฅผ ์กฐ๊ธฐ์— ๊ฑฐ๋ถ€ํ•  ์œ„ํ—˜์ด ์กด์žฌํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ฏผ๊ฐ๋„๋ฅผ ๋”์šฑ ์ •๊ตํ•˜๊ฒŒ ์กฐ์ •ํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘