Movie reviews are classified into positive and negative by training two different classifiers (Naive Bayes and Logistic Regression). Solution is provided in Python programming language and the best classifier has achieved 81% test accuracy, which compares reaches the accuracy of other state-of-the-art classifiers trained on the same dataset at the time (links are available at the end of this README).
View the exploratory notebooks here:
The selected text processing pipeline starts by transforming each text line into a corresponding token vector. Stop-words (pronouns, etc.) and other tokens such as symbols are filtered out, and the remaining words are stemmed and lower-cased. Then a certain set of tokens is selected as the Features for the model (i.e. Bag-of-words approach) as unigrams, bigrams, or trigrams (an ordered set of words out of 1, 2, or 3 elements, correspondingly). Certain features are frequent for both Positive and Negative reviews, therefore are not that useful. This is why we will incorporate TF-IDF into our pipeline (Term Frequency-Inverse Document Frequency). Then each input token vector (text line) is converted into a sparse binary vector which signifies presence or absence of that particular feature in the input line. Finally, we pass our input vector to our training algorithm along with the labels to start the supervised learning. We train the same algorithm multiple times with different text processing and training hyperparameters, picking the best algorithm by using n-fold cross-validation.
In the standard workflow of implementing ML, it is accepted to abide by the KISS principle, especially at the start. This means training the simpler models (e.g. the Naive Bayes or a Linear model such as the Logistic Regression) first. The reason for this is to avoid premature optimization, since the knowledge to make measured decisions is not available yet. For example, the corpus has only ~10k sentence samples and introducing a complex algorithm right away is likely to result in a predictive model with a high variance. Therefore, Naive Bayes and Logistic Regression algorithms are suitable choices here.
Both algorithms have achieved very similar cross-validation scores, so we can consider them equivalent by performance, although Naive Bayes was faster to train.
If the frequencies of label samples in the dataset were imbalaced, then I would have to use a performance metric that is capable of handling such situation. A basic accepted approach is to take Precision and Recall metrics (two ratios of True Positive predictions for each label). If it were to be appropriate to give equal importance to the two, then they would be combined into a one score by using a harmonic mean (i.e. the F1-score). This would constitute a proper handling of an imbalanced dataset.
You can find them in a separate repository.
Run the following commands in shell:
pip install -r requirements.txt
to install the dependencies.python main.py --dry-run
to test the configuration (Optional) (Note: it will overwrite the previously trained saved model to test file I/O).python main.py
to train the model and see the evaluation results
- If constants related to dataset shuffle/split (such as
RANDOMNESS_SEED
,DATASET_TEST_SPLIT_RATIO
) are changed, thendata/raw_structured
folder has to be deleted for the new constants to be applied. The folder will be regenerated automatically.
sentence polarity dataset v1.0 (includes sentence polarity dataset README v1.0): 5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.
- cmasch/cnn-text-classification
- elijas/review_thingie (fork from ashirviskas/reviewthingie, but with bugfix for the dataset test/train splitting)
- err8029/homework