Sign In

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Created by
  • Haebom
Category
Empty

์ €์ž

Thomas Jiralerspong, Trenton Bricken

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ LLM ์•„ํ‚คํ…์ฒ˜ ๊ฐ„์˜ ๋ชจ๋ธ ์ฐจ์ด๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๋ชจ๋ธ ๋””ํ•‘(model diffing) ๊ธฐ์ˆ ์„ ์ตœ์ดˆ๋กœ ์ ์šฉํ•˜๊ณ , ์ด๋ฅผ ์œ„ํ•ด Dedicated Feature Crosscoders (DFCs)๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. DFCs๋Š” ํŠน์ • ๋ชจ๋ธ์— ๊ณ ์œ ํ•œ ํŠน์ง•์„ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ด๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์œผ๋กœ Qwen, Deepseek, Llama, GPT ๋“ฑ ๋‹ค์–‘ํ•œ LLM ๊ฐ„์˜ ์ค‘๊ตญ ๊ณต์‚ฐ๋‹น ๊ด€๋ จ์„ฑ, ๋ฏธ๊ตญ ์ค‘์‹ฌ์ฃผ์˜, ์ €์ž‘๊ถŒ ๊ฑฐ๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ ๊ฐ™์€ ์˜๋ฏธ ์žˆ๋Š” ํ–‰๋™ ์ฐจ์ด๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์„œ๋กœ ๋‹ค๋ฅธ LLM ์•„ํ‚คํ…์ฒ˜ ๊ฐ„์˜ ํ–‰๋™ ์ฐจ์ด๋ฅผ ๋น„์ง€๋„ ๋ฐฉ์‹์œผ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ๊ธธ์„ ์—ด์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
DFCs๋Š” ๋ชจ๋ธ ๋””ํ•‘ ์—ฐ๊ตฌ์˜ ์ ์šฉ ๋ฒ”์œ„๋ฅผ ๋„“ํžˆ๊ณ , LLM์˜ ์•ˆ์ „ ๋ฐ ํŽธํ–ฅ์„ฑ ๋ฌธ์ œ๋ฅผ ๋” ํšจ๊ณผ์ ์œผ๋กœ ์ดํ•ดํ•˜๋Š” ๋ฐ ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ฐœ๊ฒฌ๋œ ํŠน์ง•๋“ค์ด ์‹ค์ œ ๋ชจ๋ธ์˜ ๋ชจ๋“  ํ–‰๋™์„ ๋Œ€ํ‘œํ•˜๋Š”์ง€๋Š” ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋ฉฐ, DFCs์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์‹ฌ์ธต์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘