Sign In

Ask don't tell: Reducing sycophancy in large language models

Created by
  • Haebom
Category
Empty

์ €์ž

Magda Dubois, Cozmin Ududec, Christopher Summerfield, Lennart Luettgau

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ์‚ฌ์šฉ์ž์˜ ์˜๊ฒฌ์— ๋™์กฐํ•˜๋Š” ํ˜„์ƒ(sycophancy)์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์€ ์ž…๋ ฅ๋ฌธ์˜ ํ˜•์‹(์งˆ๋ฌธ vs. ๋น„์งˆ๋ฌธ), ์ธ์‹์  ํ™•์‹ค์„ฑ, ๊ด€์ (๋‚˜ vs. ์‚ฌ์šฉ์ž) ๋“ฑ์˜ ์š”์ธ์ด sycophancy์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„์งˆ๋ฌธ ํ˜•ํƒœ์˜ ์ž…๋ ฅ์ด sycophancy๋ฅผ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œํ‚ค๋ฉฐ, ์ธ์‹์  ํ™•์‹ค์„ฑ๊ณผ '๋‚˜' ๊ด€์  ํ”„๋ ˆ์ด๋ฐ์ด ์ด๋ฅผ ์ฆํญ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LLM์˜ sycophancy๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์งˆ๋ฌธ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๊ณ , '๋‚˜' ๊ด€์ ๋ณด๋‹ค๋Š” ์‚ฌ์šฉ์ž ๊ด€์ ์—์„œ ๋‹ต๋ณ€ํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
โ€ข
๋น„์งˆ๋ฌธ ํ˜•ํƒœ์˜ ์ž…๋ ฅ์„ ์งˆ๋ฌธ์œผ๋กœ ์ „ํ™˜ํ•˜๋„๋ก ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์ง€์‹œ๋งŒ์œผ๋กœ๋„ sycophancy๋ฅผ ์œ ์˜๋ฏธํ•˜๊ฒŒ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋‹จ์ˆœํžˆ 'sycophanticํ•˜์ง€ ๋ง๋ผ'๋Š” ์ง€์‹œ๋ณด๋‹ค ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ๊ฐœ๋ฐœ์ž์™€ ์‚ฌ์šฉ์ž ๋ชจ๋‘ ์‰ฝ๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์ ์ธ ์ž…๋ ฅ ์ˆ˜์ค€์˜ ์™„ํ™” ์ „๋žต์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋‹ค์–‘ํ•œ LLM ๋ชจ๋ธ๊ณผ ๋ณต์žกํ•œ ๋Œ€ํ™” ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ sycophancy ์™„ํ™” ํšจ๊ณผ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘