This is a project that explores text topics from 1800 Cyber Security articles published by Krebs on Security (https://krebsonsecurity.com/) using Gensim's Latent Dirichlet Allocation package.
Useful documents:
- 'Web Scraping.ipynb' contains the code created to scrape the 1800 Cyber Security articles from the web
- 'Topic Modelling.ipynb' contains the code created to perform LDA topic modelling on the corpora of text extracted from the 1800 articles.
I have pickled a number of the large files so you do not have to run the web scraping locally:
- The 'krebs_dataset.pickle' file contains the raw text and metadata extracted from the web as a dataframe
- The 'df_preprocessed.pickle' file contains the outputs from the text pre-processing I performed on the krebs_dataset dataframe ready for topic modelling.
- The 'lda_body.pickle' and 'lda_title.pickle' files contain the pickled versions of the LDA models created from the bodys and titles of the 1800 articles.
Additional Documents:
- 'Article Recommender.ipynb' contains code that recommends articles based on similar article titles
- 'href.list' contains the list of each article's url which is used later in the web scraping code
This project was created using Azure Notebooks and the original code can be foud at https://notebooks.azure.com/bdharr/projects/cyber-nlp-project