Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Created by
  • Haebom

Author

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, Chris Koch

Outline

This paper investigates the worst-case scenario risk of deploying the open-source GPT model (gpt-oss). To maximize gpt-oss's capabilities in both biological and cybersecurity domains, we employ the malicious fine-tuning (MFT) technique. To maximize biorisk, we selected threat-generating tasks and trained gpt-oss in a web-browsing reinforcement learning environment. To maximize cybersecurity risk, we trained gpt-oss in an agent-coding environment to solve the Capture-The-Flag (CTF) problem. We compared the MFT model with other large-scale language models with open and closed weights. Compared to closed models, the MFT gpt-oss underperformed OpenAI o3, which scored below the Preparedness High level, in both biological and cybersecurity risk. Compared to open models, gpt-oss slightly improved biorisk, but not significantly. These results contributed to model deployment decisions, and we hope that the MFT approach will provide useful guidance for assessing the risks of future open-weighted model deployments.

Takeaways, Limitations

Takeaways: We present a novel approach to assessing the potential risks of open-source large-scale language models using the malicious fine-tuning (MFT) technique. MFT can help us more accurately predict the actual risk level and contribute to the development of safe model deployment strategies. The results of this study provide useful information for decision-making regarding open-source model deployment.
Limitations: The risk level assessed through MFT may not perfectly reflect the risk level in the real world. Due to the limitations of the tasks and environments used in the assessment, there is a possibility that the risk in real-world situations may be underestimated or overestimated. Further research is needed that considers more diverse and realistic scenarios.
👍