Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Semantic Preprocessing for LLM-based Malware Analysis

Created by
  • Haebom

Author

Benjamin Marais, Tony Quertier, Gr egoire Barrue

Outline

In the field of malware analysis, this paper points out the limitations of existing AI-based approaches that focus on data representation (images, sequences) without considering expert perspectives. To improve this, we propose a preprocessing method centered on expert knowledge that enhances malware semantic analysis and result interpretability. Specifically, we present a novel preprocessing method that generates JSON reports for Portable Executable (PE) files. This report collects features extracted from static and dynamic analyses and integrates knowledge from packer signature detection, MITRE ATT&CK, and the Malware Behavior Catalog (MBC). The goal of this preprocessing is to collect semantic representations of binary files that are understandable to malware analysts and enhance the explainability of AI models for malware analysis. Using this preprocessing, we trained a large-scale language model for malware classification, achieving a weighted average F1-score of 0.94 on a complex dataset representing market reality.

Takeaways, Limitations

Takeaways:
A novel preprocessing methodology leveraging expert knowledge: Improving semantic representation of malware analysis through JSON reports for PE files.
Improving the explainability of AI models: Integrating expert knowledge to enhance the interpretability of AI models.
Achieved high performance: Achieved a weighted average F1-score of 0.94 on complex datasets.
Limitations:
The information provided alone does not provide a deep understanding of the specific implementation of the preprocessing method.
Limited extensibility to other malware formats (e.g. shellcode, scripts, etc.).
Limitations on generalization as the results are for a specific dataset.
👍