Skip to content

Optimal Transport for Function-Level and Line-Level Vulnerability Detection

License

Notifications You must be signed in to change notification settings

awsm-research/DeepVulMatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


DeepVulMatch: Learning and Matching Latent Vulnerability Representations for Dual-Granularity Vulnerability Detection
(Replication Package)

DeepVulMatch

Learning and Matching Latent Vulnerability Representations for Dual-Granularity Vulnerability Detection

Table of contents

  1. How to reproduce
  2. Citation

How to reproduce

Environment Setup

First of all, clone this repository to your local machine and access the main dir via the following command:

git clone https://github.com/awsm-research/DeepVulMatch.git
cd optimatch

Then, install the python dependencies via the following command:

pip install -r requirements.txt
  • We highly recommend you check out this installation guide for the "torch" library so you can install the appropriate version on your device.

  • To utilize GPU (optional), you also need to install the CUDA library, you may want to check out this installation guide.

  • Python 3.9.7 is recommended, which has been fully tested without issues.

Reproduction of Experiments

Download necessary data and unzip via the following command:

cd data
sh download_data.sh 
cd ..

Reproduce Main Results (Table 1 in the paper)

  • OPTIMATCH (proposed approach)

    • Inference
    cd our_method/optimatch/saved_models/checkpoint-best-f1
    sh download_models.sh
    cd ../..
    sh test_phase_2_150pat.sh
    cd ..
    
    • Retrain Phase 1 Model
    cd our_method/optimatch
    sh train_phase_1.sh
    cd ..
    
    • Retrain Phase 2 Model
    cd our_method/optimatch
    sh train_phase_2_150pat.sh
    cd ..
    
  • Baselines

    To reproduce baseline approaches, please follow the instructions below:

    • Step 1: cd to "./baselines" folder
    • Step 2: cd to the specific baseline folder you wish to reproduce, e.g., "statement_codebert"
    • Step 3: cd to the models folder, e.g., "saved_models/checkpoint-best-f1"
    • Step 4: download the models via "sh download_models.sh" and "cd ../.."
    • Step 5: find the shell script named as "train_xyz.sh" (e.g., train_multi_task_baseline_codebert.sh) and run it via "sh train_xyz.sh"

    To run inference, find the shell script named as "test_xyz.sh" and run it via "sh test_xyz.sh",
    If "test_xyz.sh" does not exist, remove "do_test" command in "train_xyz.sh" and run the inference via "sh train_xyz.sh"

    A concrete example is provided as follows:

    • Statement-Level CodeBERT
      • Retrain
      cd baselines/statement_codebert/saved_models/checkpoint-best-f1
      sh download_models.sh
      cd ../..
      sh train_multi_task_baseline_codebert.sh
      cd ../..
      

Reproduce Ablation Study (Table 2 in the paper)

  • To reproduce w/o vulnerability codebook & matching, run the following commands:
    • Retrain (ignore "sh train_phase_one.sh" if running inference only)
      cd our_method/optimatch/saved_models/checkpoint-best-f1
      sh download_models.sh
      cd ../..
      sh train_phase_one.sh
      sh test_phase_one.sh
      cd ../..
      

Each ablation trial (except w/o vulnerability codebook & matching) consists of phase 1 and 2 trainings like our OPTIMATCH approach. First cd to the folder contains your interested trial. To retrain models in any phases, run "train_xyz.sh". To run inference in any phases, run "test_xyz.sh".

  • To reproduce w/o RNN embedding (mean pooling applied), cd to "./ablation/token_embedding_pooling_mean"
  • To reproduce w/o RNN embedding (mean pooling applied), cd to "./ablation/token_embedding_pooling_max"
  • To reproduce OPTIMATCH wt N vulnerability centroids, cd to "./ablation/num_patterns"

Citation

under review at IEEE TDSC

About

Optimal Transport for Function-Level and Line-Level Vulnerability Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published