Sign In

When Agents Persuade: Propaganda Generation and Mitigation in LLMs

Created by
  • Haebom
Category
Empty

์ €์ž

Julia Jose, Ritik Roongta, Rachel Greenstadt

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ์กฐ์ž‘์ ์ธ ์„ ์ „๋ฌผ ์ƒ์„ฑ์— ์•…์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•˜๋ฉฐ, LLM์—๊ฒŒ ์„ ์ „ ๋ชฉํ‘œ๋ฅผ ๋ถ€์—ฌํ•˜๊ณ  ์ด๋ฅผ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ๋ถ„์„ ๊ฒฐ๊ณผ, LLM์€ ํ”„๋กฌํ”„ํŠธ์— ๋”ฐ๋ผ ์„ ์ „์  ํ–‰๋™์„ ๋ณด์ด๋ฉฐ ๋‹ค์–‘ํ•œ ์ˆ˜์‚ฌ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ง€๋„ ํ•™์Šต ๋ฏธ์„ธ ์กฐ์ •(SFT), ์ง์ ‘ ์„ ํ˜ธ๋„ ์ตœ์ ํ™”(DPO), ORPO(Odds Ratio Preference Optimization)๋ฅผ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ๊ฒฝํ–ฅ์„ ์™„ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํƒ์ƒ‰ํ–ˆ์œผ๋ฉฐ, ORPO๊ฐ€ ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LLM์€ ์˜๋„์ ์œผ๋กœ ํ”„๋กฌํ”„ํŠธ๋  ๊ฒฝ์šฐ ์„ ์ „๋ฌผ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๊ณต๊ฐœ ํ™˜๊ฒฝ์—์„œ ๋ฐฐํฌ๋  ๋•Œ ์•…์šฉ๋  ์†Œ์ง€๊ฐ€ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), ORPO์™€ ๊ฐ™์€ ์„ ํ˜ธ๋„ ๊ธฐ๋ฐ˜ ๋ฏธ์„ธ ์กฐ์ • ๋ฐฉ๋ฒ•์€ LLM์˜ ์„ ์ „๋ฌผ ์ƒ์„ฑ ๊ฒฝํ–ฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” LLM์˜ ์„ ์ „๋ฌผ ์ƒ์„ฑ ๋ฐ ์™„ํ™”์— ๋Œ€ํ•œ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์‹ค์ œ ์ ์šฉ ์‹œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ณต์žกํ•œ ์œค๋ฆฌ์ , ์‚ฌํšŒ์  ๋ฌธ์ œ์™€ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ์ˆ˜์‚ฌ ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘