English
Share
Sign In
Moonshot AI unveils LLM service platform used by kimi
Haebom
China's leading artificial intelligence model development companies include Jifu, Moonshot, Minimax, and Baichuan. Among them, Moonshot proved its performance and prestige by exceeding 33 trillion won in corporate value. Moonshot AI has never disclosed its own language model or number of parameters. However, they just released a service called Kimi. Like chatGPT, it is a large service that allows LLM-based chatting and search.
This is a company that I personally often use and pay attention to, but this time they published an interesting paper. The paper disclosed a platform that supports LLM services called Mooncake. Of course, the service that is actually being supported live is kimi, and we are now adopting a strategy to sell this in B2B format.
The core of Mooncake lies in its KVCache-centric distributed architecture. This architecture separates the pre-fill and decoding stages and implements a distributed cache utilizing CPU, DRAM, and SSD resources. In particular, we optimized the prefill node pool to effectively handle long contexts. Parallel processing is possible by dividing nodes, and VRAM usage is minimized through layer-by-layer prefill techniques.
Another strength of Mooncake is its KVCache-centric scheduling algorithm. This algorithm maximizes cache reuse and optimizes batch size to increase FLOP utilization of your model. In addition, the efficiency of the overall system was increased by balancing cache hit rate and instance load.
Mooncake's response strategy to overload situations is also noteworthy. We introduced a prediction-based early rejection policy and minimized resource waste through system-level load prediction. These strategies contribute significantly to maintaining system stability even under rapid load increases.
Experimental results show that Mooncake increases throughput by up to 525% compared to existing methods and can handle 75% more requests under real workloads. Its excellence has been proven on various datasets such as ArXiv Summarization and L-Eval. However, what is unfortunate is that the comparison was made against LLaMa-2 70B. Why? While thinking this, it also occurred to me that this paper itself was published belatedly. (It is common to deliberately layback technology disclosure)
The development of Mooncake is a significant technological advance that significantly improves the efficiency and performance of LLM services. It is significant in that it effectively solves practical problems such as processing long contexts and responding to overload situations. There is also the possibility of further development in the future through utilization of heterogeneous accelerators or improvement of KVCache compression technique.
In conclusion, Mooncake is solving the major problems of LLM services through an innovative architecture centered on KVCache. Separation of prefill and decoding stages, efficient cache management, and intelligent overload response strategy are considered important technological advances that can greatly improve the scalability and efficiency of LLM services.
The distributed cache implementation itself using CPU, DRAM, and SSD resources is very interesting, and the fact that SLO (Service Level Objectives) was achieved through this. If you use kimi, it's actually really good. It's great when looking for articles or information related to China. Should I say it feels like a combination of chatGPT and Perplexity?
1
    Haebom
    kimi 수준을 궁금해하시는 분들이 많아 공유드리면 대충 이런 느낌입니다. 적어도 Bing chat 정도에 견줄 만큼은 됩니다.
/haebom
Subscribe