This is a sentiment analysis project based on a twitter dataset found on Kaggle here : https://www.kaggle.com/kazanova/sentiment140 Using Word2Vec and Logistic Regression all written in Python to create our models, and Flask to create the web application.
-
To run the project we recommend you use a virtual environment. We use Pipenv to create the environment and to install the requirements, you can see the documentation here: https://pypi.org/project/pipenv/
-
To create a shell inside your project directory use:
pipenv shell
-
To install the requirements using the
requirements.txt
file use:pipenv install -r requirements.txt
-
Then you can simply run:
python app.py
In this section we'll tackle what data proccessing is done on the dataset to produce the model used in the application you can find all the details in our Data_processing.ipynb
notebook
- After importing the data we cleaned it by removing abreviations, stop words and ponctuation and we also lemmatized the tweets (using
nltk
) - We took all the words present in the dataset and trained our Word2Vec model on them with window size 5 and vector size 300
- To use the model in the application we chose to use the save method on the model, this created three files one
.model
file and two.npy
files
After competing our data processing we move on to training our Logistic Regression model and these were the steps taken:
- We represented each tweet by the sum of the vectors of each word in the tweet and we stored them into a matrix that represented our features.
- We split the dataset into training and testing sets with the training set containg 30% of the data.
- We trained our Logistic Regression Model using
scikit-learn
library and saved the model as apkl
file using thepickle
library. The results of the model on testing set were as follows:
- When we wanted to use the model we simply called
pickle.load
to regain the model and analyze the user's input