Credit Risk Analysis

Credit Risk Analysis with Machine Learning

Overview of the Analysis

The purpose of this analysis is to build several Machine Learning models and algorithms to predict credit risk for loan applications. After completion of this analysis, approving or denying application for loan will be more efficient, accurate and lower default rates. I will utilize Python and Scikit-learn libraries and several machine learning models to compare the ML models and determine how well each model classifies and predicts data.

Results

In this project, I am utilizing following models and algorithms to find best prediction model for credit risk analysis:

Oversampling Models : RandomOverSampler and SMOTE algorithms.
Undersampling Model : ClusterCentroids algorithm.
Combining Models : SMOTEENN algorithm that combines the SMOTE and Edited Nearest Neighbors (ENN) algorithms.
Comparing Models : BalancedRandomForestClassifier and EasyEnsembleClassifier.

After applying exploratory data analysis with Pandas and Numpy for the dataset, I am using the imbalanced-learn and scikit-learn libraries for evaluating three machine learning models by using resampling to determine which is better at predicting credit risk. I will start the analysis with RandomOverSampler and SMOTE Oversampling algorithms, and then use the undersampling ClusterCentroidsalgorithm. Using these algorithms, I will resample the dataset, view the count of the target classes, train a logistic regression classifier lastly compare each model to determine best model that fit for this analysis.

Note: A random state of 1 for each sampling algorithm to ensure consistency between tests.

Naive Random Oversampling VS SMOTE Oversampling:

In this section, foll0wing metrics will be provided in order to discover which algorithm results in the best performance between Naive random oversampling algorithm and the SMOTE algorithm

Calculate the balanced accuracy score from sklearn.metrics
Calculate the confusion matrix from sklearn.metrics
Generate a classication report using the imbalanced_classification_report from imbalanced-learn.

Naive Random Oversampling	SMOTE Oversampling
Accuracy score: 0.64 Precision High risk: 0.01 Low risk: 1.00 Recall High risk: 0.66 Low risk: 0.62	Accuracy score: 0.65 Precision High risk: 0.01 Low risk: 1.00 Recall High risk: 0.61 Low risk: 0.69

Cluster Centroids Undersampling VS SMOTEENN :

In this section, following metrics will be provided in order to discover which algorithm results in the best performance between Cluster Centroids undersampling and SMOTEENN.

Calculate the balanced accuracy score from sklearn.metrics
Calculate the confusion matrix from sklearn.metrics
Generate a classication report using the imbalanced_classification_report from imbalanced-learn.

Cluster Centroids Undersampling	SMOTEENN Combination (Over and Under) Sampling
Accuracy score: 0.54 Precision High risk: 0.01 Low risk: 1.00 Recall High risk: 0.69 Low risk: 0.40	Accuracy score: 0.64 Precision High risk: 0.01 Low risk: 1.00 Recall High risk: 0.71 Low risk: 0.57

Balanced Random Forest Classifier VS Easy Ensemble AdaBoost Classifier:

In this section, foll0wing metrics will be provided in order to discover which algorithm results in the best performance between Balanced Random Forest Classifier and Easy Ensemble AdaBoost Classifier

Calculate the balanced accuracy score from sklearn.metrics
Calculate the confusion matrix from sklearn.metrics
Generate a classication report using the imbalanced_classification_report from imbalanced-learn.

Balanced Random Forest Classifier	Easy Ensemble AdaBoost Classifier
Accuracy score: 0.79 Precision High risk: 0.03 Low risk: 1.00 Recall High risk: 0.70 Low risk: 0.87	Accuracy score: 0.93 Precision High risk: 0.09 Low risk: 1.00 Recall High risk: 0.92 Low risk: 0.94

Summary

Before moving forward with a summary report, I would like to point out a few reminders regarding following metrics:

Classifying a single point can result in a true positive (truth = 1, guess = 1), a true negative (truth = 0, guess = 0), a false positive (truth = 0, guess = 1), or a false negative (truth = 1, guess = 0).
Accuracy measures how many classifications your algorithm got correct out of every classification it made.
Recall measures the percentage of the relevant items your classifier was able to successfully find.
Precision measures the percentage of items your classifier found that were relevant.
Precision and recall are tied to each other. As one goes up, the other will go down.
F1 score is a combination of precision and recall.
F1 score will be low if either precision or recall is low.

I have created 6 different algorithms and models to discover optimum way to predict credit risk. Comparing Accuracy Scores for each model would be a quick and efficient way to decide which model would perform better than others for this data set. We can observe that Easy Ensemble AdaBoost Classifier is performed significantly better than the other models as its Accuracy Scores is around 92%. However, despite of having high Accuracy Scores the F-Score is relatively low as 0.16.Also,having a low precision would cause low risk loans to be labeled as high risk falsely. Making wrong decision on loan application will decrease the revenue and trustworthiness of the bank. Therefore, I would reject using these algorithms as decision mechanisms for predicting credit risk for loan applications. My recommendation would be creating larger scale dataset with more features and better selections on features to improve Precision and F1 scores which improve confidence of our machine learning algorithms.

Resources

Data Source: LoanStats_2019Q1.csv
Software/Languages: Jupyter Notebook- Google Colab, Python
Libraries: Scikit-learn

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
M_L_S_L		M_L_S_L
Resources		Resources
.DS_Store		.DS_Store
README.md		README.md
credit_risk_ensemble.ipynb		credit_risk_ensemble.ipynb
credit_risk_resampling.ipynb		credit_risk_resampling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Risk Analysis

Overview of the Analysis

Results

Naive Random Oversampling VS SMOTE Oversampling:

Cluster Centroids Undersampling VS SMOTEENN :

Balanced Random Forest Classifier VS Easy Ensemble AdaBoost Classifier:

Summary

Before moving forward with a summary report, I would like to point out a few reminders regarding following metrics:

Resources

About

Releases

Packages

Languages

aktugchelekche/Credit_Risk_Analysis

Folders and files

Latest commit

History

Repository files navigation

Credit Risk Analysis

Overview of the Analysis

Results

Naive Random Oversampling VS SMOTE Oversampling:

Cluster Centroids Undersampling VS SMOTEENN :

Balanced Random Forest Classifier VS Easy Ensemble AdaBoost Classifier:

Summary

Before moving forward with a summary report, I would like to point out a few reminders regarding following metrics:

Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages