Review Scores From Written Reviews

Motivation and Project Explanation

For this project, I decided to use NLP and RNN's to predict a user given review score from the users' written review. In order to do this, I used the Amazon review dataset and used the Book product category to build and test the model. This data set was made up of a lot of user metadata and product data but I decided to only look at the Review Body, Review Headline, and Star Rating columns to build the model.

For EDA please refer to this other project of mine but for this project, I was dealing with heavily imbalanced data.

Project Overview

Read Data in Through Spark
Clean Text Through Spacy
Build Word2Vec on the training set
Use Word2Vec embeddings and LSTM to predict user scores
Develop Flask App

Text Cleaning

In order to do this project, I needed to read in the data through Spark and s3 then clean the text in Spark with user-defined functions and save the resulting data frames to s3 for later access. In order to accomplish this, I tried using an AWS EMR but ran into problems when bootstrapping the instances so I ended up using a docker image on an EC2 instance and changing Sparks executor memory to prevent memory overflow errors.

The function I wrote to clean the text used Beautiful Soup and Unicode to remove formatting and Spacy with the en_core_web_sm model as the tokenizer. The documents ended up being a combination of the review headline and review body which were the two text columns of the dataset

def text_cleaner(Doc):
        if Doc is not None:
            Doc = ' '.join(BeautifulSoup(Doc).findAll(text=True))
            try:
                Doc = unidecode.unidecode(codecs.decode(Doc, 'unicode_escape'))
            except:
                Doc = unidecode.unidecode(Doc)
            Doc = re.sub("’", "'", Doc)
            Doc = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in Doc.split(" ")])
            Doc = nlp(Doc)
            final_tokens = []
            for t in Doc:
                if t.is_space or t.like_num or t.like_url or t.is_stop:
                    pass
                else:
                    if t.lemma_ == '-PRON-':
                        final_tokens.append(t.text)
                    else:
                        sc_removed = re.sub("[^a-zA-Z]", '', t.lemma_)
                        if len(sc_removed) > 1:
                            final_tokens.append(sc_removed)
            return ' '.join(final_tokens)
        else:
            return Doc

This code takes each document removes any markdown formatting and turns all Unicode into ascii then replaces all contractions with the full explanation. The resulting string is then read into spaCy where it tokenized. I then run through the tokens and remove whitespace, numbers, URLs, and stop words from the document and then return lemmatized words with cases preserved. Then using a premade spark function I split the strings into lists for the word2vec and predictive models down the line. This results in the following transformation:

I then applied a random split to the Spark data frame and got a train, validation, and testing set to work with for the rest of the project. With this done I exported these data frames to an s3 bucket for later use.

Word2Vec

For data featurization, I decided to do a Word2Vec model due to an understanding of the process and familiarity with Gensim. Word2Vec is an unsupervised learning technique of embedding a word into a dense float vector where the vector contains context about the word. The model is built in what is called the skip-gram method where a context window moves across the string and tries to predict the word in the middle of the window. After this process is completed, the calculated weights become the Word2Vec vector.

After some testing and checking, I found that these settings:

min_count: 15 #Amount of times a word needs to appear to be part of the model
window: 4 #The context window that is used to predict each word
epoch: 300
size: 300 #The ending size of the vectors

worked the best and produced the best embeddings as you can see here:

Original Word	Most Simular	Second Most Simular	Third Most Simular	Forth Most Simular	Fifth Most Simular
fiction	Nationally	Blackoaks	appreciators	Perspectives	TSNOTD
fantasy	manipulatively	Palacios	Oregan	precisly	Auel
romance	jb	Absorb	Aishling	deem	Institutionalizing
religion	frei	Hadoke	Cariou	malapropisms	conceited
history	Promoters	cockamamie	aspiration	dustbin	Undercurrents

Predictive Model

To build my predictive model I decided to look at sentiment analysis Neural Nets as a base because review scores can be seen as a type of sentiment. In order to apply to Word2Vec model, the first layer of my model was an embedding layer to which I added in the Word2Vec weights and I then made that layer untrainable. This layer then went into an LSTM and then a dense layer with a tanh activation function into a single dense output with a relu activation. I then rounded the results and found the MSE of the model.

In order to make the model work, I looked at the training data and found the largest string of the data and set that to be the length of all my vectors so that I could capture all the context of each sentence. After this, my model replaced all strings with an integer index that corresponded with a dictionary entry and then fed the data into my model.

My model ended up looking like this

Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, None, 300)         12360900
_________________________________________________________________
lstm_1 (LSTM)                (None, 75)                112800
_________________________________________________________________
dense_1 (Dense)              (None, 20)                1520
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 21
=================================================================
Total params: 12,475,241
Trainable params: 114,341
Non-trainable params: 12,360,900
_________________________________________________________________

Acknowledgments

Galvanize Instructor Team
The spaCy deep learning tutorial

Flask app

As part of this project, I developed a flask app inside a docker container so anyone can run the model and play around with it. You can pull it with the following command:

docker pull aidanjared/reviewpredictor:latest
docker run -p 80:5000 aidanjared/reviewpredictor:latest

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
config		config
images		images
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Review Scores From Written Reviews

Motivation and Project Explanation

Project Overview

Text Cleaning

Word2Vec

Predictive Model

Acknowledgments

Flask app

About

Releases

Packages

Languages

Aidan-Jared/Review-Predictor

Folders and files

Latest commit

History

Repository files navigation

Review Scores From Written Reviews

Motivation and Project Explanation

Project Overview

Text Cleaning

Word2Vec

Predictive Model

Acknowledgments

Flask app

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages