From 1e309b1e91baf77188384479514efc0cd0dab487 Mon Sep 17 00:00:00 2001 From: asdfghjkxd <73705042+asdfghjkxd@users.noreply.github.com> Date: Tue, 26 Oct 2021 21:37:17 +0800 Subject: [PATCH] Final Release --- pyfiles/README.md | 139 ++-------------------------------------------- 1 file changed, 4 insertions(+), 135 deletions(-) diff --git a/pyfiles/README.md b/pyfiles/README.md index 43d0a77..b7b120d 100644 --- a/pyfiles/README.md +++ b/pyfiles/README.md @@ -1,4 +1,4 @@ -# News Analyser App +# ArticPy ## Introduction This app is used for the analysis of news articles provided in the format of a CSV file. @@ -20,14 +20,14 @@ default to 8501. Click on the link to open up your web browser and to access the ## Use of Streamlit's Web App If you wish to access the functionalities provided in this app but you do not wish to download the files in this repository and run it natively on your own device, you may choose to navigate to the webpage -https://share.streamlit.io/asdfghjkxd/newsanalyserapp/main/app.py to use the app. However, because it is +https://share.streamlit.io/asdfghjkxd/articpy/main/app.py to use the app. However, because it is run on the web, you must manually download the output file and save it on your device with the relevant filename and file extensions. ## Explanation of Modules and the Functions in the App This section will go from the root folder down into the subfolders, and attempt to explain the functions of the files -that are stored in the folders. Not all files will be explained some groups of files have similar purposes as explained -in the parent folder explanation. +that are stored in the folders. Not all files/folders will be explained in the following section. Only files/folders +which are important to you are explained below.
@@ -50,137 +50,6 @@ be free from modifications. Erroneous modification of the files in this folder m the app. If you wish to change the functionality of the app and is knowledgeable on how to do so, feel free to change the code to suit your needs. -#### _pages/document_term_matrix.py_ File -This file is used for the preparation and creation of a Document-Term Matrix. Users must first determine the size and -data format of the raw data they wish to use and choose the appropriate input file method and format. After loading -the data, a CountVectorizer object will be instantiated. A "bag-of-words" string will also be created; this string -contains all the words that are used in the text of interest. This string is then converted to a dictionary and then -fit-transformed using the CountVectorizer object. The resulting fit-transformed data is then converted to an array -and then fed into a pandas DataFrame, creating the Document-Term Matrix. This file can then be saved into a CSV file -and then downloaded into the user's system. - -#### _pages/load_clean_visualise.py_ File -This file is used for the loading, cleaning and visualising of raw data used for the app. Users must first determine -size and data format of the raw data they wish to use and choose the appropriate input file method and format. -Following that, the user can specify flags and parameters to pass into the app. After this, if the user specified to -clean the data, two separate copies of the raw data will be created, one for the purpose of creating a cleaned copy of -the raw data, and the second being a modified copy of the cleaned raw data [this step lemmatizes all the words present -in the text].
- -For the cleaning stage, the steps taken are as such: -1. The raw data is piped into a pandas DataFrame -2. The raw data is duplicated into two identical DataFrames, **DF0** (the DataFrame to be cleaned without modification) - and **DF1** (the DataFrame that will have all the words lemmatized) -3. Using for loops and panda's internal iterative functions, we iterate through each row of **DF0** - 1. For each row, we used Python's Regular Expressions library _re_ to systematically remove any unwanted characters, - leaving only alphanumerics and punctuations - 2. Using pandas' _at_ function, we assign the new expression to the current row of the iteration - 3. This process is repeated until all rows have been iterated through - 4. Note that this dataset should only be used as a primary cleaning step to get rid of illegal characters and not - for any sort of NLP task, since the data does not contain only lemmatized or stemmed words -4. Similarly, we iterate through **DF1** using the same method as above - 1. For each row, we used Python's Regular Expressions library _re_ to systematically remove any unwanted characters, - leaving only alphanumerics - 2. We then instantiate a TextBlob object from the _textblob_ library and pass to it the string stored in the row - 3. The TextBlob object then assign tags to the words and lemmatizes the words in the string based on the tags - 4. Using pandas' _at_ function, we assign the new expression to the current row of the iteration - 5. This process is repeated until all rows have been iterated through - 6. This data is safe for NLP tasks as all the words have already been lemmatized -5. Empty rows in both **DF0** and **DF1** are then removed. -6. If the above steps are successful, users will then be able to download the data into their system as a CSV file. -7. If the user specifies the `verbose` flag, both datasets will be printed out and displayed onto the user's screen, - subjected to the maximum size of DataFrame that can be printed out. - -#### _pages/toolkit_discriminator.py_ File -This file is used for the discrimination between fake news and legitimate news, and is also able to discriminate -between machine-written articles and human-written articles. This module uses BERT from Google. For this module, users -must use a GPU or TPU accelerated system, as the code uses Tensorflow backend to run the machine learning models needed -to execute the discrimination function.
- -For GPUs, users are warned that only NVIDIA GPUs are permitted due to the restrictions of Tensorflow, which is made -specifically for the CUDA cores found on NVIDIA GPUs. If you are using an AMD GPU -https://medium.com/analytics-vidhya/install-tensorflow-2-for-amd-gpus-87e8d7aeb812, you must install specific drivers -to allow Tensorflow to run. Refer to this article for more information on implementing Tensorflow on an AMD GPU. -Alternatively, you may wish to spin up a Virtual Machine instance on your favourite Cloud Service Provider to run -the app.
- -After enabling GPU or TPU acceleration, the user may wish to specify their own parameters and flags during the -machine learning process. Once the parameters are loaded up or redefined, the user may then proceed with the -Inference and Classification task. - -Users may run the discriminator in training mode, and create a model for distinguishing between fake and real news, -though raw, classified data must be provided for the training to be successful. - -#### _pages/toolkit_nlp.py_ File -This file contains all the NLP Functionalities used in the app. - -**Word Cloud** -This functionality allows the user to input a CSV or XLSX file containing a bag-of-words representation of the text -they wish to analyse to create a Word Cloud of the text. -
- -**Summary** -This file is used for creating summaries for text inputted to it. - -How this module works -1. The user needs to upload a CSV/XLSX file with the correct format. A template is provided for the user to view - and download, though the template is limited to just the CSV format. -2. After filling in the details the user wants to process, the user must then decide on the size of the dataset they - wish to upload and the file format to process the data in. -3. The user will then be presented with a set of flags and parameters they can modify to change the behaviour of the - app. -4. Once the user loads up the data, the text will begin the summarising process - 1. Stopwords are removed first - 2. The loaded data is then iterated through - 3. The necessary helper lists are then created for ever pass of the iteration - 4. The words are then converted to all lowercases and then counted - 5. Only words that are used commonly are retained - 6. The words are then appended together into one large string -5. The resulting data is then printed out to the screen, subjected to the overall size of the dataset loaded -
- -**Sentiment Analysis** -This file is used for sentiment analysis of the text users input into the app. Users can specify whether to run VADER -or TextBlob for their sentiment analysis, though TextBlob will be used by default if the user does not modify the -settings of the app.
- -Note the following differences between the modules used: -* TextBlob is more well-suited for formal pieces of text, such as reports and news articles -* VADER is more well-suited for informal pieces of text, such as text messages, and is able to better process text - containing emojis. - -How this module works -1. The user needs to upload a CSV/XLSX file with the correct data format. A template is provided for the user to view - and download, though the template is limited to just the CSV format. -2. After filling in the details the user wants to process, the user must then decide on the size of the dataset they - wish to upload and the file format to process the data in. -3. The user will be presented with a set of flags which the user can modify to change the behaviour of the app, and - then be prompted to load the data. -4. Once loaded, stopwords will be removed from the text (this functionality has not been implemented so far) -5. If VADER is enabled, VADER will be used to conduct sentiment analysis on the text, otherwise TextBlob will be used - by default. Both implementations of VADER and TextBlob are simplistic, i.e. they obtain the word score from a giant - corpus and obtain the average word scores for the entire text or sentences. -6. After the analysis is done, users can then view and save the data into a CSV file. -
- -**Topic Modelling** -This file is used to conduct topic modelling on the document users input into the app. Users are to select the Document -ID (the index of the text in the data) and model the topic of the text; this is a limitation of the app. - -How this module works -1. The user uploads a CSV/XLSX file with the correct data format. A template is provided for the user to view and - download, though the template is provided in the CSV format only. -2. After filling in the details the user wants to process, the user must then decide on the size of the dataset they - wish to upload and the file format to process the data in. -3. The user will be presented with a set of flags which the user can modify to change the behaviour of the app, and - then be prompted to load the data. -4. Once loaded, the data will be processed into a gensim Dictionary object. An LDA model will also be initialised. -5. The LDA model will then be trained using the gensim Dictionary. -6. After the model is trained, pyLDAvis will be used to visualise the results of the training. -7. Users can then view and save the data into a CSV file. - -
- #### _pages/multipage_ File This file is used for the implementation of the multiple page functionality of the app. Do not modify this file.