Skip to content

Latest commit

 

History

History
298 lines (252 loc) · 9.55 KB

File metadata and controls

298 lines (252 loc) · 9.55 KB

Predictive Analytics with Python

These are my notes from working through the book Learning Predictive Analytics with Python by Ashish Kumar and published on Feb 2016.

General

###Chapter 1: Getting Started with Predictive Modelling

  • Installed Anaconda Package.
  • Python3.5 has been installed.
  • Book follows python2, so some codes is modified along the way for python3.

###Chapter 2: Data Cleaning

  • Reading the data: variations and examples
  • Data frames and delimiters.

####Case 1: Reading a dataset using the read_csv method

  • File: titanicReadCSV.py
  • File: titanicReadCSV1.py
  • File: readCustomerChurn.py
  • File: readCustomerChurn2.py
  • File: changeDelimiter.py

####Case 2: Reading a dataset using the open method of Python

  • File: readDatasetByOpenMethod.py

####Case 3: Reading data from a URL

  • Modified the code that it works and prints out line by line dictionary of the dataset.
  • File: readURLLib2Iris.py
  • File: readURLMedals.py

####Case 4: Miscellaneous cases

  • File: readXLS.py
  • Created the file above to read from both .xls an .xlsx

####Basics: Summary, dimensions, and structure

  • File: basicDataCheck.py
  • Created the file above to read from both .xls an .xlsx

####Handling missing values

  • File: basicDataCheck.py
  • RE: Treating missing data like NaN or None
  • Deletion orr imputaion

####Creating dummy variables

  • File: basicDataCheck.py
  • Split into new variable 'sex_female' and 'sex_male'
  • Remove column 'sex'
  • Add both dummy column created above.

####Visualizing a dataset by basic plotting

  • File: plotData.py
  • Figure file: ScatterPlots.jpeg
  • Plot Types: Scatterplot, Histograms and boxplots

###Chapter 3: Data Wrangling ####Subsetting a dataset

  • Selecting Columns
  • File: subsetDataset.py
  • Selecting Rows
  • File: subsetDatasetRows.py
  • Selecting a combination of rows and columns
  • File: subsetColRows.py
  • Creating new columns
  • File: subsetNewCol.py

####Generating random numbers and their usage

  • Various methods for generating random numbers
  • File: generateRandomNumbers.py
  • Seeding a random number
  • File: generateRandomNumbers.py
  • Generating random numbers following probability distributions
  • File: generateRandomProbDistr.py
  • Probability density function: PDF = Prob(X=x)
  • Cumulative density function: CDF(x) = Prob(X<=x)
  • Uniform distribution: random variables occur with the same (uniform) frequency/probability
  • Normal distribution: Bell Curve and most ubiquitous and versatile probability distribution
  • Using the Monte-Carlo simulation to find the value of pi
  • File: calcPi.py
  • Geometry and mathematics behind the calculation of pi
  • Generating a dummy data frame
  • File: generateDummyDataFrame.py

####Grouping the data – aggregation, filtering, and transformation

  • File: groupData.py
  • Grouping
  • Aggregation
  • Filtering
  • Transformation
  • Miscellaneous operations

####Random sampling – splitting a dataset in training and testing datasets

  • File: splitDataTrainTest.py
  • Method 1: using the Customer Churn Model
  • Method 2: using sklearn
  • Method 3: using the shuffle function

####Concatenating and appending data

  • File: concatenateAndAppend.py
  • File: appendManyFiles.py

####Merging/joining datasets

  • File: mergeJoin.py
  • Inner Join
  • Left Join
  • Right Join
  • An example of the Inner Join
  • An example of the Left Join
  • An example of the Right Join
  • Summary of Joins in terms of their length

###Chapter 4: Statistical Concepts for Predictive Modelling ####Random sampling and central limit theorem ####Hypothesis testing

  • Null versus alternate hypothesis
  • Z-statistic and t-statistic
  • Confidence intervals, significance levels, and p-values
  • Different kinds of hypothesis test
  • A step-by-step guide to do a hypothesis test
  • An example of a hypothesis test

####Chi-square testing ####Correlation

  • File: linearRegression.py
  • File: linearRegressionFunction.py
  • Picture: TVSalesCorrelationPlot.png
  • Picture: RadioSalesCorrelationPlot.png
  • Picture: NewspaperSalesCorrelationPlot.png

###Chapter 5: Linear Regression with Python ####Understanding the maths behind linear regression

  • Linear regression using simulated data
  • File: linearRegression.py
  • Picture: CurrentVsPredicted1.png
  • Picture: CurrentVsPredictedVsMean1.png
  • Picture: CurrentVsPredictedVsModel1.png

####Making sense of result parameters

  • File: linearRegression.py
  • p-values
  • F-statistics
  • Residual Standard Error (RSE)

####Implementing linear regression with Python

  • File: linearRegressionSMF.py
  • Linear regression using the statsmodel library
  • Multiple linear regression
  • Multi-collinearity: sub-optimal performance of the model
  • Variance Inflation Factor
  • It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables.

####Model validation

  • Training and testing data split
  • File: linearRegressionSMF.py
  • Linear regression with scikit-learn
  • File: linearRegressionSKL.py
  • Feature selection with scikit-learn
  • Recursive Feature Elimination (RFE)
  • File: linearRegressionRFE.py

####Handling other issues in linear regression

  • Handling categorical variables
  • File: linearRegressionECom.py
  • Transforming a variable to fit non-linear relations
  • File: nonlinearRegression.py
  • Picture: MPGVSHorsepower.png
  • Picture: MPGVSHorsepowerVsLine.png
  • Picture: MPGVSHorsepowerModels.png
  • Handling outliers
  • Other considerations and assumptions for linear regression

###Chapter 6: Logistic Regression with Python ####Linear regression versus logistic regression ####Understanding the math behind logistic regression

  • File: logisticRegression.py
  • Contingency tables
  • Conditional probability
  • Odds ratio
  • Moving on to logistic regression from linear regression
  • Estimation using the Maximum Likelihood Method
  • Building the logistic regression model from scratch
  • File: logisticRegressionScratch.py
  • Read above again.
  • Making sense of logistic regression parameters
  • Wald test
  • Likelihood Ratio Test statistic
  • Chi-square test
  • [x]

####Implementing logistic regression with Python

  • File: logisticRegressionImplementation.py
  • Processing the data
  • Data exploration
  • Data visualization
  • Creating dummy variables for categorical variables
  • Feature selection
  • Implementing the model

####Model validation and evaluation

  • File: logisticRegressionImplementation.py
  • Cross validation

####Model validation

  • File: logisticRegressionImplementation.py
  • The ROC curve {see terms}

###Chapter 7: Clustering with Python ####Introduction to clustering – what, why, and how?

  • What is clustering?
  • How is clustering used?
  • Why do we do clustering?

####Mathematics behind clustering

  • Distances between two observations
  • Euclidean distance
  • Manhattan distance
  • Minkowski distance
  • The distance matrix
  • Normalizing the distances
  • Linkage methods
  • Single linkage
  • Compete linkage
  • Average linkage
  • Centroid linkage
  • Ward's method uses ANOVA method
  • Hierarchical clustering
  • K-means clustering
  • File: kMeanClustering.py

####Implementing clustering using Python

  • File: clusterWine.py
  • Importing and exploring the dataset
  • Normalizing the values in the dataset
  • Hierarchical clustering using scikit-learn
  • K-Means clustering using scikit-learn
  • Interpreting the cluster

####Fine-tuning the clustering

  • The elbow method
  • Silhouette Coefficient

###Chapter 8: Trees and Random Forests with Python ####Introducing decision trees

  • A decision tree

####Understanding the mathematics behind decision trees

  • Homogeneity
  • Entropy
  • Information gain
  • ID3 algorithm to create a decision tree
  • Gini index
  • Reduction in Variance
  • Pruning a tree
  • Handling a continuous numerical variable
  • Handling a missing value of an attribute

####Implementing a decision tree with scikit-learn

  • File: decisionTreeIris.py
  • Visualizing the tree
  • Picture: dtree2.png
  • File: dtree2.dot
  • Cross-validating and pruning the decision tree

####Understanding and implementing regression trees

  • File: regressionTree.py
  • Regression tree algorithm
  • Implementing a regression tree using Python

####Understanding and implementing random forests

  • File: randomForest.py
  • The random forest algorithm
  • Implementing a random forest using Python
  • Why do random forests work?
  • Important parameters for random forests

###Chapter 9: Best Practices for Predictive Modelling ####Best practices for coding

  • Commenting the codes
  • Defining functions for substantial individual tasks
  • Example 1
  • Example 2
  • Example 3
  • Avoid hard-coding of variables as much as possible
  • Version control
  • Using standard libraries, methods, and formulas

####Best practices for data handling

####Best practices for algorithms

####Best practices for statistics

####Best practices for business contexts