This project demonstrates sentiment analysis on the Large Movie Review Dataset using Natural Language Processing (NLP) techniques. It includes data preprocessing, model training, and evaluation of four different machine learning algorithms: Logistic Regression, Multinomial Naive Bayes, Random Forest, and Support Vector Machine.
The Large Movie Review Dataset contains:
- 25,000 positive and 25,000 negative labeled reviews for training and testing.
- An additional set of 50,000 unlabeled reviews for unsupervised learning. You can download the dataset from the following link: Large Movie Review Dataset
To use the dataset, please cite the following ACL 2011 paper:
@inproceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }
- Extracting and Loading the Dataset: The dataset is extracted and loaded into pandas DataFrames.
- Cleaning the Text: The text is cleaned by removing non-alphanumeric characters, URLs, and punctuation.
- Removing Stop Words: Common stop words are removed to reduce noise in the data.
- Lemmatization: Reduces words to their base or dictionary form, considering the context (e.g., "better" becomes "good").
- Stemming: Reduces words to their base or root form by removing suffixes (e.g., "running" becomes "run").
- Tokenization: Splits text into individual words or tokens (e.g., "The cat sat on the mat" becomes ["The", "cat", "sat", "on", "the", "mat"]).
Four different machine learning algorithms are trained using TF-IDF vectorized features:
- Logistic Regression
- Multinomial Naive Bayes
- Random Forest
- Support Vector Machine
The models are evaluated using the following metrics:
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC
The results are visualized using bar charts and ROC curves.
Model | Accuracy | Precision | Recall | F1-score | ROC-AUC |
---|---|---|---|---|---|
Logistic Regression | 0.8816 | 0.8760 | 0.8907 | 0.8833 | 0.9509 |
Multinomial Naive Bayes | 0.8472 | 0.8529 | 0.8414 | 0.8471 | 0.9254 |
Random Forest | 0.8364 | 0.8468 | 0.8239 | 0.8351 | 0.9183 |
Support Vector Machine | 0.8854 | 0.8772 | 0.8978 | 0.8874 | 0.9544 |
This project demonstrates how to apply and compare different machine learning algorithms for sentiment analysis using the Large Movie Review Dataset. The results show that the Support Vector Machine model achieved the highest accuracy and ROC-AUC score.
This project is licensed under the MIT License.