Skip to content

Latest commit

 

History

History
89 lines (42 loc) · 5.29 KB

README.md

File metadata and controls

89 lines (42 loc) · 5.29 KB

Machine Learning SVM Tune

Research has shown that social media has a huge impact on the purchasing decisions of an individual. The project combined the data from the two websites namely Yelp and Groupon to study the change in demand for the product, given certain deals and ratings for the commodity. It used Support Vector Regression and Random forest method to predict the demand of any product or services.

DATASET

The dataset for Yelp and Groupon was collected by a python- based toolbox. The goal was to accumulate Groupon deals offered for the products and services in Washington D.C. and then the Yelp reviews corresponding to it. The Groupon deals contains information about the deal period, expiry time, prices, discounts, locations where the coupons could be used. Similarly, the Yelp records contained some general information about the product like the location, the attributes for reviews and ratings given by user including the author, rating, comments and finally link between the Groupon and Yelp. As a result, a cross social media dataset was developed for Yelp and Groupon websites.

YELP

Alt text

GROUPON

Alt text

LINKING YELP TO GROUPON

Alt text

(Dataset not posted due to confidentiality)

FEATURE EXTRACTION

Chi Square

Calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target. If the target variable is independent of the feature variable, we can discard that feature variable. If they are dependent, the feature variable is very important. The result obtained by feature selection taking “Sold Quantity” is given below:

Alt text

Top 25 attributes out of the given features were selected for further training the model. Thus, the final dataset has 1,68,184 rows ,25 columns as features and 1 column for the class label “sold quantity”.

SUPPORT VECTOR MACHINES

Out of 1,68,184 rows, 2/3rd of the dataset was taken for training and the remaining 1/3rd for testing. The data taken for training is chosen randomly without replacement. Further, training is carried out using SVM Regression.

For the SVM Regression used, the kernel chosen was “radial” with “epsilon” type regression and suitable parameters of gamma and cost were tuned to minimize error. Finally, the model received after training was used to predict the “sold quantity” values for the test set. The observed and predicted values are measured to generate the accuracy of the model. The predictions were rounded off to 100, to get a more accuracy and closeness in the data. The predicted data and the observed data was used for further evaluation of accuracy.

METHOD:

Alt text

SVM TUNING

However, before the training was performed, there was a Support Vector Machine (SVM) Tuning of the data performed to obtain the best parameters of cost and gamma. The experiment yields the value of cost to be 16 and gamma as 0.5. These cost and gamma values are further used to train the model to get high accuracy. Moreover, we use the method of ‘epsilon regression’ to make no penalty be imposed if the predicted is at a distance of epsilon from the actual value. The default value of 0.1 was taken for epsilon.

Alt text

The plot for the actual vs predicted values for SVM are shown below:

Alt text

RANDOM FORESTS

We train the dataset with Random forest algorithm with the required parameters number of trees grown (ntree) and number of predicors that is sampled for every node that splits up (mtry). These two parameters majorly define the random forest algorithm. It is important to have the value of parameter ‘mtry’ such that the error is the least. Therefore, we plot the error vs mtry to find its optimal value.

Fig below shows the plot for parameter mtry of Random Forest with the OOB error.

Alt text

The plot demonstrates that initially the error decreases as mtry increases and then at the point where mtry=9 it becomes constant. Thus, we can use values greater than 9 from mtry to get least error in the model. We then train the dataset using the parameter mtry equal to 10 and number of trees as the default value for the random forest classifier.

The plot for the actual vs predicted values for Random Forest are shown below:

Alt text

We can observe that the plot for Random Forests are more precise than support vector machines for this dataset, because the plot is more compact towards the y=x line.

EVALUATIONS

Coefficient of Determination (R2) and Root Mean Square Error (RMSE) to evaluate and understand the effectiveness of our trained model with the predictions that it makes. Table shown below:

Alt text

FILES

preprocess.R - preprocessing the data
svm.R - running svm model
randomf.R - running Random Forest model