Which characteristics of businesses associate with different operational metrics from customer ratings and reviews?
- Do the services provided by the business to the customer, hours of operation and time during the year of the business play any role in the amount of positive ratings or reviews?
- Does the activity of the individual who reviews businesses demonstrate any patterns that could affect the increased number of positive ratings/reviews?
- Is the text contained within customer reviews associated with any of the features provided and engineered?
The data was retrieved from here. This includes business, review, user, tip and photo json
files. The photo.json
was not utilized in the data warehouse.
A data warehouse was constructed by joining the five sets utilized in the analysis. The reviews
were used as the backbone of the warehouse in order to increase the number of reviews. The data was joined in the following order:
review
business
user
tip
checkin
The business
data was used to obtain the name of business in the tip
set. This allowed for features to be engineered like the number of compliments. The checkin
set allowed for features to be created based of the time information. Exploratory data analysis (EDA)
was then completed to examine the various components of the warehouse including the reviews, users, business ratings, locations and time. Prior to the NLP processing step, the warehouse was filtered to the food categories with over 30k counts and the states with the seven highest counts of reviews.
For the text processing step, all non-English words were removed prior to processing the reviews using langid
for language detection and dask
for parallel processing. Then, all non-words were removed. Then the reviews were processed for stopwords
using nltk
and lemmatized using WordNetLemmatizer
. Further EDA
was conducted after processing.
Word2Vec
fromgensim
to create vocabulary -> Classification usingXGBoost
,CatBoost
andLightGBM
utilizing aGPU
.tokenizer
fromKeras
withnum_words=100000
andmax_length=500
->Bidirectional LSTM
usingTensorflow
withembedding_size=300
usingbatch_size=256
.- Tokenization and vectorization using
tf.keras.layers.TextVectorization
withmax_features=50000
andsequence_length=300
-> Classifier using layers consisting ofEmbedding
,GlobalAveragePooling1D
, withDropout(0.2)
after each layer with a finalDense
layer. The baselineEmbedding
size used wasembedding_dim=32
. The loss function utilized waslosses.BinaryCrossentropy
,Adam
as the optimizer and the default learning rate. The models were trained for 10epochs
usingbatch_size=1
. Hyperparameter tuning using grid and random search withWeights & Biases
tracking. BoW
andTF-IDF
models were constructed inPyTorch
by sampling 20,000 of each target group, building a vocabulary ofmax_len=300
andmax_vocab=10000
, tokenizing withwordpunct_tokenize
, removing rare words and constructingBoW vector
andTF-IDF vector
.batch_size=1
was used for both models. AFeedfowardTextClassifier
with two hiddens layers was used for classification. The loss function utilized wasCrossEntropyLoss
,Adam
as the optimizer,learning_rate=6e-5
and thescheduler=CosineAnnealingLR(optimizer, 1)
. Early stopping was used if the validation loss was greater than the validation loss from three previousepochs
.- A classifier utilizing
BertModel.from_pretrained('bert-base-uncased')
usingPyTorch
was constructed using theBertTokenizer.from_pretrained('bert-base-uncased')
withmax_length=300
. The model architecture employed an initialDropout(p=0.4)
followed by a linear, another dropout, aReLU
and a final linear layer. The loss function wasCrossEntropyLoss
withAdam
optimizer and alearning_rate=5e-6
.batch_size=1
andbatch_size=8
were tested in different models.