GitHub - acatarinaoaraujo/software-vulnerability-classification: An Exploratory Examination Of Software Vulnerability Classification Using Large Language Models

Software Vulnerability Classification using Large Language Models

ABSTRACT

Software vulnerabilities are critical weaknesses that can compromise the security of a system. While current research primarily focuses on automating the classification and detection of them using a range of machine learning models, there remains a notable gap in integrating ontologies like the Vulnerability Description Ontology with Large Language Models (LLMs) for enhanced classification accuracy. Our study utilizes the National Vulnerability Database (NVD) and the National Institute of Standards and Technology’s Vulnerability Description Ontology framework to enhance the clas- sification of these vulnerabilities. The methodology involves an in-depth analysis of NVD data and an investigation of the effectiveness of various LLMs to analyze vulnerability descriptions across 27 vulnerability categories in 5 noun groups. Our findings reveal that LLMs, particularly BERT and DistilBERT, demonstrate stronger performance when compared to traditional machine learn- ing models and entropy-based methods. Moreover, while expanding the dataset aims to capture a broader range of vulnerabilities, its effectiveness varies, highlighting the crucial role of annotation quality. This research emphasizes the importance of advanced machine learning techniques and quality data annotation in optimizing vulnerability assessment processes in cybersecurity.

RESEARCH QUESTIONS

This research work contributes to the field of cybersecurity and machine learning by providing empirical evidence on the capabilities of large language models (LLMs) in enhancing software vulnerability classification. It systematically evaluates the capacity of LLMs and compares their effectiveness against traditional models. Additionally, it explores the impact of data extension on model performance and identifies key linguistic features used by these models. The findings offer valuable insights for security professionals, indicating that LLMs can lead to more accurate assessments and reduce response time for mitigation strategies. To support further research and replication of the results, the source code and annotated data used in this study are made available on GitHub [2]. Moreover, the research underscores the importance of quality control in data annotation, highlighting its significance in improving model performance.

Research Questions:

RQ1: Does the expansion of the dataset lead to enhanced performance in vulnerability classification by large language models?

The first question addresses whether extending the data can lead to stronger models capable of categorizing a wider array of vulnerability types.

RQ2: How do large language models impact the effectiveness of software vulnerability classification?

The second question seeks to evaluate the performance of LLMs in the software vulnerability classification domain. It encompasses two sub-questions:

RQ1.1: How do models perform when classifying 27 vulnerability attributes?
This sub-question aims to compare the different models, transform-based vs. traditional, in accurately classifying software vulnerability attributes.
RQ1.2: Which models have the best average f1 score across 27 vulnerability attributes?
This sub-question aims to identify which individual models demonstrate the greatest overall effectiveness.

RQ3: What are the prevalent n-grams associated with each vulnerability attribute?

The third question shifts focus to the linguistic aspects, exploring the specific n-grams that are most prevalent and potentially indicative of different types of software vulnerabilities.

METHODOLOGY

Our research introduces an approach that leverages the power of transformers to train classifiers capable of analyzing Common Vulnerabilities and Exposures (CVE) reports and inferring their associated Vulnerability Description Ontology (VDO) characteristics. We hypothesize that the lexicon utilized within CVE descriptions contains valuable indicators of these characteristics, and through machine learning, classifiers can be trained to recognize these indicators from historical data. Fig. 4.1 provides a general overview of our proof of concept for examining the applicability of the proposed methodology.

Figure 4.1: An illustration of the proof of concept that is designed to examine the applicability of CVE descriptions classification. In summary, we engage in a multi-stage process to validate our approach:

We create a labeled dataset through the manual curation of vulnerability descriptions corre- sponding to each of the 27 VDO characteristics. In addition to that, we reuse the dataset from Okutan [29] and merge it with ours.
Transform-based classifiers are trained using this dataset. The primary goal is to predict the one or more classes for text descriptions within each of the five distinct noun groups. These groups feature different classification schemes: • Multi-class classification is used when a text description corresponds to a single label, suitable for categories like Attack Theater, Context, and Impact Method noun groups. In these groups, the classes are mutually exclusive. • Multi-label classification applies to scenarios where a text description fits multiple cat- egories, such as in Logical Impact and Mitigation noun groups. These descriptions may fall into one or more classes. For instance, in the Logical Impact category, an attacker with access to a user’s data could both read and write to it.
Finally, we compute a set of evaluation metrics to benchmark the classifiers’ performance.

CONCLUSION

Our study provides a comprehensive analysis of the impact of Large Language Models (LLMs) on software vulnerability classification, along with the effectiveness of dataset expansion.

Firstly, our findings suggest that while the expansion of the dataset aimed to capture a broader array of vulnerabilities, the added data did not uniformly enhance model performance, indicating the critical role of annotation quality.
Secondly, LLMs, particularly BERT and DistilBERT, stood out as the top performers in soft- ware vulnerability classification. XLNet and DeBERTa V3 also demonstrated strong perfor- mances, underscoring the capabilities of these models in software vulnerability classification. In comparison with Okutan’s study, LLMs outperformed the results of the models in that study. However, we must note that the conventional Decision Tree model and the Ensemble Learning technique also displayed highly competitive performances.
Lastly, n-grams analysis revealed patterns associated with vulnerability attributes, offering insights for targeted mitigation strategies and staying ahead of emerging threats. These findings highlight the importance of not only applying advanced machine learning tech- niques but also maintaining rigorous quality control in data annotation to optimize the performance of models used in software vulnerability classification. This study lays the groundwork for further research into dataset extension and the integration of LLMs to refine the processes of vulnerability assessment in the field of cybersecurity.

DOWNLOAD MODELS

Google Drive

REPOSITORY STRUCTURE

Folders:

Dataset: Contains the combined dataset used in the research - the writer's annotated dataset and Okutan's et al original dataset.
RQ1: The source code of all transformer-based models fine-tuned with the dataset above and Okutan's et al original dataset.
RQ3: The source code to extract n-grams.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Dataset		Dataset
RQ1		RQ1
RQ3		RQ3
nvd-script		nvd-script
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Software Vulnerability Classification using Large Language Models

Table of Contents

ABSTRACT

RESEARCH QUESTIONS

RQ1: Does the expansion of the dataset lead to enhanced performance in vulnerability classification by large language models?

RQ2: How do large language models impact the effectiveness of software vulnerability classification?

RQ3: What are the prevalent n-grams associated with each vulnerability attribute?

METHODOLOGY

CONCLUSION

DOWNLOAD MODELS

REPOSITORY STRUCTURE

About

Releases

Packages

Languages

acatarinaoaraujo/software-vulnerability-classification

Folders and files

Latest commit

History

Repository files navigation

Software Vulnerability Classification using Large Language Models

Table of Contents

ABSTRACT

RESEARCH QUESTIONS

RQ1: Does the expansion of the dataset lead to enhanced performance in vulnerability classification by large language models?

RQ2: How do large language models impact the effectiveness of software vulnerability classification?

RQ3: What are the prevalent n-grams associated with each vulnerability attribute?

METHODOLOGY

CONCLUSION

DOWNLOAD MODELS

REPOSITORY STRUCTURE

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages