Sign In

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Created by
  • Haebom
Category
Empty

์ €์ž

Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ˆจ๊ฒจ์ง„ ์˜ค์—ผ(misalignment)์„ ํƒ์ง€ํ•˜๋Š” ์–ด๋ ค์›€์— ์ฃผ๋ชฉํ•˜๋ฉฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด "๋ถ„ํ•  ์„ฑ๊ฒฉ ํ›ˆ๋ จ(Split Personality Training, SPT)"์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SPT๋Š” ํ‰์†Œ์—๋Š” ๋น„ํ™œ์„ฑ ์ƒํƒœ๋กœ ์œ ์ง€๋˜๋Š” ๋ณ„๋„์˜ '์ •์งํ•œ ํŽ˜๋ฅด์†Œ๋‚˜'๋ฅผ LoRA ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•™์Šต์‹œํ‚จ ํ›„, ๋ชจ๋ธ ์‘๋‹ต ์‹œ ์ด๋ฅผ ํ™œ์„ฑํ™”ํ•˜์—ฌ ์ž ์žฌ๋œ ์ง€์‹์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, SPT๋Š” Llama-3.3-70B ๋ชจ๋ธ์˜ 96% ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ์••๋„์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LLM์˜ ์ˆจ๊ฒจ์ง„ ์˜ค์—ผ ๋ฐ ์ž ์žฌ ์ง€์‹ ํƒ์ง€๋ฅผ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์™ธ๋ถ€ ๊ด€์ฐฐ์ž์—๊ฒŒ๋Š” ์ ‘๊ทผ ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ํŽธํ–ฅ ๋“ฑ ์ž ์žฌ ์ •๋ณด๋ฅผ ๋“œ๋Ÿฌ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
SPT์˜ LoRA ์–ด๋Œ‘ํ„ฐ ํฌ๊ธฐ, ํŠธ๋ฆฌ๊ฑฐ ๋ฌธ์ž์—ด์˜ ๋ฏผ๊ฐ๋„, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ ๋“ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘