Sign In

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

Created by
  • Haebom
Category
Empty

์ €์ž

Nenad Banfic, David Fan, Kunal Vaishnavi, Sam Kemp, Sunghoon Choi, Rui Ren, Sayan Shaw, Meng Tang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ GPU ๊ฐ€์† ์—†์ด CPU์—์„œ ๊ณ ํ’ˆ์งˆ์˜ ์˜จ๋””๋ฐ”์ด์Šค ์ŠคํŠธ๋ฆฌ๋ฐ ์ž๋™ ์Œ์„ฑ ์ธ์‹(ASR)์„ ์œ„ํ•œ ๋ชจ๋ธ ๊ฐœ๋ฐœ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ์ตœ์‹  ASR ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋น„๊ต ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, NVIDIA Nemotron Speech Streaming์ด ์ €์‚ฌ์–‘ ํ•˜๋“œ์›จ์–ด์—์„œ์˜ ์‹ค์‹œ๊ฐ„ ์˜์–ด ์ŠคํŠธ๋ฆฌ๋ฐ์— ๊ฐ€์žฅ ์ ํ•ฉํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ONNX Runtime ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ 2.47GB์—์„œ 0.67GB๊นŒ์ง€ ์ค„์ด๋ฉด์„œ๋„ ์›๋ž˜ ๋ชจ๋ธ๊ณผ 1% ์ด๋‚ด์˜ ๋‹จ์–ด ์˜ค๋ฅ˜์œจ(WER)์„ ์œ ์ง€ํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์ €์‚ฌ์–‘ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ๋„ GPU ์—†์ด ์‹ค์‹œ๊ฐ„ ๊ณ ํ’ˆ์งˆ ASR์ด ๊ฐ€๋Šฅํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋‹ค์–‘ํ•œ ASR ๋ชจ๋ธ ๋ฐ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ์ฒด๊ณ„์ ์ธ ๋น„๊ต ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ œ์‹œํ•˜์—ฌ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์„ค์ •์— ๋„์›€์„ ์ค๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ int4 k-quant ๋ชจ๋ธ์€ 8.20%์˜ ๋‚ฎ์€ ํ‰๊ท  ์ŠคํŠธ๋ฆฌ๋ฐ WER๊ณผ 0.56์ดˆ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ๋‹ฌ์„ฑํ•˜์—ฌ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์˜ ์ƒˆ๋กœ์šด ๊ธฐ์ค€์ ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ์˜์–ด ์ŠคํŠธ๋ฆฌ๋ฐ ASR์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์–ธ์–ด ๋˜๋Š” ๋ณต์žกํ•œ ์Œํ–ฅ ํ™˜๊ฒฝ์—์„œ์˜ ์„ฑ๋Šฅ์€ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘