Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model

Created by
  • Haebom

Author

Deepak Kumar, Divakar Yadav, Yash Patel

Outline

This paper presents the results of a comparative evaluation of the Mixture-of-Experts (MoE)-based GPT-OSS-20B model (20.9 billion parameters in total, about 3.6 billion active parameters) with dense models Qwen3-32B and Yi-34B on a single GPU (H100, bf16). The evaluation metrics are Time to First Token (TTFT), Total Decoding Throughput (TPOT), End-to-End Latency Percentile, Peak VRAM Usage (including PKV), and Energy Consumption, using a consistent nvidia-smi-based sampler. Under 2048-token context and 64-token decoding conditions, GPT-OSS-20B exhibits higher decoding throughput and energy efficiency per token than Qwen3-32B and Yi-34B, and also significantly reduces peak VRAM usage and energy consumption per 1000 tokens generated. The TTFT is higher due to the MoE routing overhead. Using only 17.3% of the active parameters (3.6 billion out of 20.9 billion), GPT-OSS-20B achieved approximately 31.8% higher decoding throughput and 25.8% lower energy consumption per 1,000 tokens generated than Qwen3-32B under 2048/64 conditions, while also reducing peak VRAM usage by 31.7%. When normalized to the active parameters, GPT-OSS-20B showed significantly higher Active Per Parameter Efficiency (APE), highlighting the distributional benefits of MoE. This study focuses on distribution and does not evaluate accuracy. We make the code and integrated results public for reproducibility and extension.

Takeaways, Limitations

Takeaways:
We demonstrate that the MoE-based model, GPT-OSS-20B, provides higher decoding throughput and energy efficiency than the dense model.
This suggests that it is possible to achieve similar or better performance than dense models while using only a subset of the active parameters.
It provides significant evidence on the deployment efficiency of the MoE architecture.
The published code and results enable further research and reproducibility.
Limitations:
The exclusion of accuracy evaluation limits the ability to evaluate the overall performance of the model.
Since it was evaluated in a single GPU environment, we do not know how it will perform when scaled to a multi-GPU environment.
Since the impact of MoE routing overhead was observed in the TTFT metric, further research is needed in this area.
👍