Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Created by
  • Haebom

Author

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao

Outline

To address the lack of high-quality pre-training data in the cybersecurity field, we present a comprehensive dataset covering key training stages, including pre-training, instruction fine-tuning, and inference distillation. Extensive analytical studies demonstrate the dataset's effectiveness on public cybersecurity benchmarks, demonstrating that continuous pre-training with the dataset leads to a 15.9% improvement in aggregate scores, and inference distillation leads to a 15.8% improvement in security certification (CISSP). To encourage research, we release the entire dataset and the trained cybersecurity LLM under the Open Data Collection By-Laws (ODC-BY) and MIT licenses.

Takeaways, Limitations

Providing high-quality open-source datasets for cybersecurity LLM research.
Demonstrating improved cybersecurity benchmark performance through continuous pretraining and inference distillation.
Improve research accessibility by making all datasets and model weights public.
The specific Limitations of the paper is not provided.
👍