Sign In

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM) ์ถ”๋ก  ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ์„œ๋น„์Šค ์ˆ˜์ค€ ๋ชฉํ‘œ(SLO)์™€ GPU ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ ์‚ฌ์ด์˜ ๊ทผ๋ณธ์ ์ธ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” SuperInfer ์‹œ์Šคํ…œ์€ NVLink-C2C๋ฅผ ํ†ตํ•ด GPU-CPU๊ฐ€ ๊ธด๋ฐ€ํ•˜๊ฒŒ ๊ฒฐํ•ฉ๋œ Superchip ์•„ํ‚คํ…์ฒ˜์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, SLO๋ฅผ ์ธ์ง€ํ•˜๋Š” ๋Šฅ๋™์ ์ธ ํšŒ์ „ ์Šค์ผ€์ค„๋Ÿฌ์ธ RotaSched์™€ NVLink-C2C๋ฅผ ํ†ตํ•œ ์ „์ด์ค‘(full-duplex) ์ „์†ก์„ ์ง€์›ํ•˜๋Š” DuplexKV๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋†’์€ ์š”์ฒญ๋ฅ ์—์„œ๋„ ์‘๋‹ต์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ SLO ๋‹ฌ์„ฑ๋ฅ ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
Superchip๊ณผ ๊ฐ™์€ ๊ณ ์„ฑ๋Šฅ ํ•˜๋“œ์›จ์–ด ์•„ํ‚คํ…์ฒ˜์˜ ์ž ์žฌ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” SLO๋ฅผ ๊ณ ๋ คํ•œ ์Šค์ผ€์ค„๋ง ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์˜ ๋™์‹œ ์„ค๊ณ„๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.
โ€ข
RotaSched์™€ DuplexKV๋ฅผ ํ†ตํ•ด LLM ์ถ”๋ก  ์‹œ์Šคํ…œ์€ ์—„๊ฒฉํ•œ TTFT SLO๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๋ฉด์„œ๋„ TBT ๋ฐ ์ฒ˜๋ฆฌ๋Ÿ‰ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์€ ํŠน์ • Superchip ์•„ํ‚คํ…์ฒ˜์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์–ด, ๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ์—์„œ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘