Skip to content

A Natural Language Project that explores the topics of discussion from 1800 Cyber Security articles. Various NLP libraries were used including Gensim, where the Latent Dirichlet Allocation topic modelling approach was used.

Notifications You must be signed in to change notification settings

87bdharr/Cyber-Topic-Modelling

Repository files navigation

Cyber-Topic-Modelling

This is a project that explores text topics from 1800 Cyber Security articles published by Krebs on Security (https://krebsonsecurity.com/) using Gensim's Latent Dirichlet Allocation package.

Useful documents:

  • 'Web Scraping.ipynb' contains the code created to scrape the 1800 Cyber Security articles from the web
  • 'Topic Modelling.ipynb' contains the code created to perform LDA topic modelling on the corpora of text extracted from the 1800 articles.

I have pickled a number of the large files so you do not have to run the web scraping locally:

  • The 'krebs_dataset.pickle' file contains the raw text and metadata extracted from the web as a dataframe
  • The 'df_preprocessed.pickle' file contains the outputs from the text pre-processing I performed on the krebs_dataset dataframe ready for topic modelling.
  • The 'lda_body.pickle' and 'lda_title.pickle' files contain the pickled versions of the LDA models created from the bodys and titles of the 1800 articles.

Additional Documents:

  • 'Article Recommender.ipynb' contains code that recommends articles based on similar article titles
  • 'href.list' contains the list of each article's url which is used later in the web scraping code

This project was created using Azure Notebooks and the original code can be foud at https://notebooks.azure.com/bdharr/projects/cyber-nlp-project

About

A Natural Language Project that explores the topics of discussion from 1800 Cyber Security articles. Various NLP libraries were used including Gensim, where the Latent Dirichlet Allocation topic modelling approach was used.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published