Skip to content

Latest commit

 

History

History
59 lines (35 loc) · 3.19 KB

README.md

File metadata and controls

59 lines (35 loc) · 3.19 KB

Here are some attempts of statistical learning algorithms implementation, just for learning and understanding purpose, and mainly inspired by the book "An Introduction to Statistical Learning". It surely contains some mistakes, feel free to report them !

Decision Tree (Regression)

Decision trees can be used for both regression and classification. Here is an implementation for regression that could further be extended to classification. Below is an example of decision tree built from a single predictor and a quantitative variable :

Decision Tree :

Decision Tree

Training data and model fit chart : Training data and predictions

Random Forest (Regression)

The Random Forest algorithm is built on top of the Decision Tree algorithm above. In short, Random Forest algorithm generates n trees built on bootstrap samples from the original training dataset. At each split, only m random candidate predictors are considered (instead of all). A prediction from a Random Forest for a given observation is the mean of all predictions from the individual trees.

Random Forest example

K-means clustering

K-means is a clustering algorithm. That is, finding K homogeneous clusters of observations within the dataset. Here is a example of clustering performed on a dataset that contains two variables, with k = 3.

Dataset before K-means clustering : Sample dataset

Dataset after K-means clustering : Sample dataset clustered

Multiple linear regression

Multiple linear regression is maybe the most basic statistical learning algorithm for predicting quantitative responses. I use Ordinary least squares method to estimate the parameters. Here is an example performed on a dataset that contains one variable with an exponential shape :

Linear regression plot

linearRegression.indicators() function allows to get some statistical data over the model. The standard error for the coefficients is computed by generating several bootstrap sample datasets.

Linear regression indicators

k-nearest neighbors

k-nearest neighbors can be used to predict quantitative and qualitative responses by picking up the K nearest neighbors for a given observation and then estimate the response by averaging the neighbors response in case of a quantitative variable - or taking the most frequent response for a categorical one. knn.cv() function allows to choose the best K value by performing cross-validation on a range of values for K (from 1 to 10 by default). It performs a 10-folds CV by default. Here is an example performed on a dataset that contains 2 predictors, a qualitative variable of two levels and k=3.

Traning dataset :

knn training dataset cross validation error rate

Test dataset with predicted responses for k=3 :

knn training predictions