PyTorch implementation of Malware Detection by Eating a Whole EXE, Learning the PE Header, Malware Detection with Minimal Domain Knowledge, and other derived models for malware detection.
All model checkpoints are available at assets/checkpoints
.
- Clone this repository via
$ git clone https://github.com/jaketae/deep-malware-detection.git
$ cd pytorch-malware-detection
- Create a Python virtual environment and install dependencies.
$ python -m venv venv
$ source venv/bin/activate
$ pip install -U pip wheel # update pip
$ pip install -r requirements.txt
- Prepare PE files.
src/bin
provides scrapers to download malware. For instance, to download files from dalswerk, run
$ python -m src.bin.dasmalwerk
By default, this will download the files under the raw
folder of the root directory.
- Train the model.
$ cd src/deep_malware_detection
$ python train.py --benign_dir=YOUR_PATH_TO_BENIGN --malware_dir=YOUR_PATH_TO_MALWARE
This project was developed in late 2020, and unfortunately I lost access to the server where I collected data and ran experiments. While replicating all training data exactly may be infeasible, here are some resources for data collection.
- Wikidll.com: Online website with downloadable benign
.dll
files. Scraper. - Dasmalwerk: Online website with downloadable malware for research. Scraper.
- Malshare.com: Online website with downloadable malware for research. Scraper.
- EMBER: Open dataset for malware detection research.
- Kaggle dataset: PE file dataset availalbe on Kaggle, including both benign and malicious files.
- While Raff et. al used LSTMs for the sequential model, we tested both GRU and LSTMs and found that the former was easier to train.
- We combined models presented in the two papers to derive a custom model that uses concatenated feature vector produced by the entry point 1D-CNN layer as well as the RNN units that follow. We denote these custom models with a "Res" prefix in the table below.
- We also further develop the attention-based model in Raff et. al with this residual approach.
- Due to computational constraints, we decided to only use PE file headers up to their 4096th bytes, thus creating a 4096 dimensional sequential feature vector for every file.
Presented below is a table detailing the performance of each model.
Architecture | Acc | F1 |
---|---|---|
MalConvBase | 91 | .931 |
MalConv+ | 94 | .951 |
MalConv+ (E16) | 93 | .944 |
MalConv+ (W64) | 94 | .949 |
MC+ (E16,W64) | 94 | .950 |
MC+ (C256) | 91 | .930 |
GRU-CNN | 93 | .946 |
BiGRU-CNN | 91 | .931 |
GRU-CNN (H128) | 93 | .946 |
ResGRU-CNN | 94 | .948 |
AttnGRU-CNN | 94 | .952 |
AttnResGRU-CNN | 94 | .952 |
For visualizations of training and model evaluation, refer to images in the figures
directory.
The coding style is dictated by black and isort. You can apply them via
# pip install black isort
make style
Please feel free to submit issues or pull requests.
If you find this repository helpful for your research, please cite as follows.
@misc{dmd,
title = {Deep Malware Detection: A neural approach to malware detection in portable executables},
author = {Tae, Jaesung},
year = 2020,
howpublished = {\url{https://github.com/jaketae/deep-malware-detection}}
}
@misc{raff2017malware,
title = {Malware Detection by Eating a Whole EXE},
author = {Edward Raff and Jon Barker and Jared Sylvester and Robert Brandon and Bryan Catanzaro and Charles Nicholas},
year = 2017,
eprint = {1710.09435},
archiveprefix = {arXiv},
primaryclass = {stat.ML}
}
@article{Raff_2017,
title = {Learning the PE Header, Malware Detection with Minimal Domain Knowledge},
author = {Raff, Edward and Sylvester, Jared and Nicholas, Charles},
year = 2017,
journal = {Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec ’17},
publisher = {ACM Press},
doi = {10.1145/3128572.3140442},
isbn = 9781450352024,
url = {http://dx.doi.org/10.1145/3128572.3140442}
}
Released under the MIT License.