Sign In

Feature Starvation as Geometric Instability in Sparse Autoencoders

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Faris Chaudhry, Keisuke Yano, Anthea Monod

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ํฌ์†Œ ์ž๋™ ์ธ์ฝ”๋”(SAE)์—์„œ ๋ฐœ์ƒํ•˜๋Š” 'ํŠน์ง• ๊ณ ๊ฐˆ(feature starvation)' ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™” ๊ธฐํ•˜ํ•™์  ๋ถˆ์•ˆ์ •์„ฑ์˜ ๊ทผ๋ณธ์ ์ธ ๋ณ‘๋ฆฌ๋กœ ๊ทœ๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด $\ell_1$ ์ •๊ทœํ™” SAE๋Š” ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋น„ํšจ์œจ์ ์ธ ๊ธฐ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ์—ฐ๊ตฌ๋Š” $\ell_2$ ์ •๊ทœํ™”์™€ ์ ์‘์  $\ell_1$ ์žฌ๊ฐ€์ค‘์น˜๋ฅผ ๊ฒฐํ•ฉํ•œ AEN-SAE๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. AEN-SAE๋Š” ์ด๋ก ์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ํฌ์†Œ ์ฝ”๋”ฉ ๋งต์„ ๋ณด์žฅํ•˜๋ฉฐ, ์‹คํ—˜์ ์œผ๋กœ ํŠน์ง• ๊ณ ๊ฐˆ ๋ฌธ์ œ๋ฅผ ๋ณด์กฐ ๊ธฐ๋ฒ• ์—†์ด ํ•ด๊ฒฐํ•˜๊ณ  ๋›ฐ์–ด๋‚œ ๋ณต์› ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
$\ell_1$ ์ •๊ทœํ™” SAE์˜ ํŠน์ง• ๊ณ ๊ฐˆ ๋ฌธ์ œ๋Š” ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ ๋ถ€์กฑ์ด ์•„๋‹Œ, ์ตœ์ ํ™” ๊ณผ์ •์˜ ๊ทผ๋ณธ์ ์ธ ๊ธฐํ•˜ํ•™์  ๋ถˆ์•ˆ์ •์„ฑ์—์„œ ๋น„๋กฏ๋จ์„ ์ด๋ก ์ ์œผ๋กœ ๊ทœ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ AEN-SAE๋Š” ์™„์ „ ์ฐจ๋ถ„ ๊ฐ€๋Šฅํ•œ(fully differentiable) ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ํŠน์ง• ๊ณ ๊ฐˆ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ๊ธฐ์กด ๋ฐฉ์‹ ๋Œ€๋น„ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณต์žกํ•œ LLM ๊ตฌ์กฐ๋‚˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ AEN-SAE์˜ ํ™•์žฅ์„ฑ ๋ฐ ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด ์ถ”๊ฐ€์ ์œผ๋กœ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘