Sign In

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

Created by
  • Haebom
Category
Empty

์ €์ž

Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์‹ค์‹œ๊ฐ„ AI ๊ฒฝํ—˜์„ ์œ„ํ•ด ์ž์› ์ œ์•ฝ์ ์ธ ๋ชจ๋ฐ”์ผ ํ™˜๊ฒฝ์— ์ตœ์ ํ™”๋œ ์˜จ๋””๋ฐ”์ด์Šค ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(OD-LLM) ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด-์ธ-๋”-๋ฃจํ”„ ์•„ํ‚คํ…์ฒ˜ ํƒ์ƒ‰๊ณผ ์ฃผ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋Œ€์‹  ์ฃผ์˜ ์Šคํ‚ต(attention skipping)์„ ํ™œ์šฉํ•˜์—ฌ ๋‚ฎ์€ ์ง€์—ฐ ์‹œ๊ฐ„๊ณผ ๋†’์€ ํ’ˆ์งˆ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๋Š” ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‚ฐ์—… ๊ทœ๋ชจ ๋ฐฐํฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ณ  ํ‘œ์ค€ ๋ชจ๋ฐ”์ผ ๋Ÿฐํƒ€์ž„๊ณผ ํ˜ธํ™˜๋˜๋Š” MobileLLM-Flash ๋ชจ๋ธ๊ตฐ์„ ์„ ๋ณด์ด๋ฉฐ, ๋ชจ๋ฐ”์ผ CPU์—์„œ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์ตœ๋Œ€ 1.8๋ฐฐ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋ชจ๋ฐ”์ผ ํ™˜๊ฒฝ์—์„œ ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” OD-LLM ์„ค๊ณ„์˜ ์‹ค์งˆ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ํ™œ์šฉํ•˜๊ณ  ํšจ์œจ์ ์ธ ํƒ์ƒ‰ ๊ณผ์ •์„ ํ†ตํ•ด ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ฐœ๋ฐœ ๋น„์šฉ์„ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค.
โ€ข
OD-LLM ์„ค๊ณ„์— ๋Œ€ํ•œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์›์น™์„ ์ œ๊ณตํ•˜์—ฌ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐ ๊ฐœ๋ฐœ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉ๋œ ํŠน์ • ํ•˜๋“œ์›จ์–ด ๋ฐ ๋Ÿฐํƒ€์ž„์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ํŠน์„ฑ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋” ๋„“์€ ๋ฒ”์œ„์˜ ํ•˜๋“œ์›จ์–ด ๋ฐ ๋Ÿฐํƒ€์ž„์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘