Sign In

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ LLM ์„œ๋น™์—์„œ Attention๊ณผ FFN ๊ณ„์‚ฐ์„ ๋ถ„๋ฆฌํ•˜๋Š” Attention-FFN Disaggregation (AFD) ์•„ํ‚คํ…์ฒ˜์˜ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋ถ„์„์  ํ”„๋กœ๋น„์ €๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋™์ ์ธ ์›Œํฌ๋กœ๋“œ ํ™˜๊ฒฝ์—์„œ KV ์บ์‹œ ์ฆ๊ฐ€, ์š”์ฒญ ๊ธธ์ด ๋ณ€ํ™”, ๊ทธ๋ฆฌ๊ณ  Attention ์ž‘์—…์ž ๊ฐ„ ๋™๊ธฐํ™”๋กœ ์ธํ•œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋‹จ์ผ ํ†ต๊ณ„๋Ÿ‰ $\theta$๋ฅผ ์ด์šฉํ•˜์—ฌ ์ตœ์ ์˜ A/F ๋น„์œจ์„ ๊ฒฐ์ •ํ•˜๊ณ , ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ๊ทธ ์œ ํšจ์„ฑ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
LLM ์„œ๋น™์—์„œ Attention-FFN ๋ถ„๋ฆฌ ์•„ํ‚คํ…์ฒ˜์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋™์ ์ธ ์›Œํฌ๋กœ๋“œ ๋ฐ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๊ณ ๋ คํ•œ ๋ถ„์„์  ํ”„๋กœ๋น„์ €๋‹ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์—ฌ ์ž์› ํšจ์œจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ๋Š” $r$A--$1$F ํ† ํด๋กœ์ง€์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ๋” ๋ณต์žกํ•œ ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€์— ๋Œ€ํ•œ ํ™•์žฅ์€ ํ–ฅํ›„ ๊ณผ์ œ๋กœ ๋‚จ์•„์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘