Sign In

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

Created by
  • Haebom
Category
Empty

์ €์ž

Ruinan Jin, Yingbin Liang, Shaofeng Zou

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ Adam ์˜ตํ‹ฐ๋งˆ์ด์ €๊ฐ€ SGD๋ณด๋‹ค ๊ฒฝํ—˜์ ์œผ๋กœ ๋” ๋น ๋ฅธ ์ˆ˜๋ ด ์†๋„๋ฅผ ๋ณด์ด๋Š” ์ด์œ ๋ฅผ ์ด๋ก ์ ์œผ๋กœ ๊ทœ๋ช…ํ•ฉ๋‹ˆ๋‹ค. Adam์— ๋‚ด์žฌ๋œ ์ค‘์š”ํ•œ ๋‘ ๋ฒˆ์งธ ๋ชจ๋ฉ˜ํŠธ ์ •๊ทœํ™”(second-moment normalization)๋ฅผ ๋ถ„์„ํ•˜์—ฌ, Adam์ด SGD์— ๋น„ํ•ด ๊ณ ํ™•๋ฅ  ์ˆ˜๋ ด(high-probability convergence)์—์„œ ๋” ํšจ์œจ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, Adam์€ ์‹ ๋ขฐ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ($\delta$)์— ๋Œ€ํ•ด $\delta^{-1/2}$์˜ ์˜์กด์„ฑ์„ ๊ฐ–๋Š” ๋ฐ˜๋ฉด, SGD๋Š” ์ตœ์†Œ $\delta^{-1}$์˜ ์˜์กด์„ฑ์„ ์š”๊ตฌํ•จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
Adam์˜ ๋…ํŠนํ•œ ๋‘ ๋ฒˆ์งธ ๋ชจ๋ฉ˜ํŠธ ์ •๊ทœํ™”๊ฐ€ SGD ๋Œ€๋น„ ๋” ๋‚˜์€ ๊ณ ํ™•๋ฅ  ์ˆ˜๋ ด ๋ณด์žฅ์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ์ด๋ก ์  ์ฆ๋ช…์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
Adam์ด $\delta$์— ๋Œ€ํ•œ ์˜์กด์„ฑ ์ธก๋ฉด์—์„œ SGD๋ณด๋‹ค ์ด๋ก ์ ์œผ๋กœ ์šฐ์œ„์— ์žˆ์Œ์„ ๋ณด์—ฌ, ๊ฒฝํ—˜์  ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฐ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ถ„์„์€ ๊ณ ์ „์ ์ธ ์ œํ•œ๋œ ๋ถ„์‚ฐ ๋ชจ๋ธ(bounded variance model) ํ•˜์—์„œ ์ด๋ฃจ์–ด์กŒ์œผ๋ฏ€๋กœ, ๋” ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ํ˜„์‹ค์ ์ธ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ์€ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘