Skip to content

kapiltripathi/Twitter-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Twitter-data-analysis

Twitter data analysis using Bag of Words methodolgy

The main objective is mainly to predict the polarity of tweets provided in the dataset , the dataset for the sake of simplicity and testing the accuracy of proposed model ,contains a column containing polarity of the respective tweets.

PREPROCESSING

The first step involves preprocessing of the tweets.Preprocessing of the tweets is an essential step as it makes the raw text ready for mining. 1) The Twitter handles marked by usually '@' hardly comnvey any useful info hence are removed. 2) Getting rid of punctuation marks. 3) Most of the smaller do not add much value (eg. 'in','all','edx'). Remove them too.

TOKENIZATION

Tokens are individual words ot terms, and tokenization is the process of splitting a string of text into tokens, easily done using split() function.

STEMMING

Stemming is a rule-based process of stripping words of their suffixes('ing', 'ly','es','s') making several words a simple variation of the same word.This part is executed using NLTK library of python.

VISUALIZTION

WordClouds is a visualization wherein the most frequent words appear in large size and the less frequent words appear in smaller sizes. It also us to understand the correctness of the polarity of words.

FEATURE EXTRACTION

BAG-OF-WORDS
Bag of Words is a method to represent text into numerical features. The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.(https://en.wikipedia.org/wiki/Bag-of-words_model)
Bag of words can be easily be created using CountVectorizer library .(documentation:- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#examples-using-sklearn-feature-extraction-text-countvectorizer
Usage:-https://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/)

MODEL BUILDING

I used the scikit learn logistic regression classifier to build the multi-classification classifier and then have used the metrics library to compute the accuracy our model against the polarity already provided. (http://dataaspirant.com/2017/05/15/implement-multinomial-logistic-regression-python/)

About

Twitter data analysis using Bag of Words methodolgy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages