Skip to content

Credit_Risk_Analysis with Machine Learning that classification models employing resampling techniques to evaluate credit risk.

Notifications You must be signed in to change notification settings

aktugchelekche/Credit_Risk_Analysis

Repository files navigation

Credit Risk Analysis

Credit Risk Analysis with Machine Learning

Overview of the Analysis

The purpose of this analysis is to build several Machine Learning models and algorithms to predict credit risk for loan applications. After completion of this analysis, approving or denying application for loan will be more efficient, accurate and lower default rates. I will utilize Python and Scikit-learn libraries and several machine learning models to compare the ML models and determine how well each model classifies and predicts data.

Results

In this project, I am utilizing following models and algorithms to find best prediction model for credit risk analysis:

  • Oversampling Models : RandomOverSampler and SMOTE algorithms.
  • Undersampling Model : ClusterCentroids algorithm.
  • Combining Models : SMOTEENN algorithm that combines the SMOTE and Edited Nearest Neighbors (ENN) algorithms.
  • Comparing Models : BalancedRandomForestClassifier and EasyEnsembleClassifier.

After applying exploratory data analysis with Pandas and Numpy for the dataset, I am using the imbalanced-learn and scikit-learn libraries for evaluating three machine learning models by using resampling to determine which is better at predicting credit risk. I will start the analysis with RandomOverSampler and SMOTE Oversampling algorithms, and then use the undersampling ClusterCentroidsalgorithm. Using these algorithms, I will resample the dataset, view the count of the target classes, train a logistic regression classifier lastly compare each model to determine best model that fit for this analysis.

Note: A random state of 1 for each sampling algorithm to ensure consistency between tests.

Naive Random Oversampling VS SMOTE Oversampling:

In this section, foll0wing metrics will be provided in order to discover which algorithm results in the best performance between Naive random oversampling algorithm and the SMOTE algorithm

  1. Calculate the balanced accuracy score from sklearn.metrics
  2. Calculate the confusion matrix from sklearn.metrics
  3. Generate a classication report using the imbalanced_classification_report from imbalanced-learn.
Naive Random Oversampling SMOTE Oversampling
  • Accuracy score: 0.64
  • Precision
    • High risk: 0.01
    • Low risk: 1.00
  • Recall
    • High risk: 0.66
    • Low risk: 0.62
  • Accuracy score: 0.65
  • Precision
    • High risk: 0.01
    • Low risk: 1.00
  • Recall
    • High risk: 0.61
    • Low risk: 0.69

Cluster Centroids Undersampling VS SMOTEENN :

In this section, following metrics will be provided in order to discover which algorithm results in the best performance between Cluster Centroids undersampling and SMOTEENN.

  1. Calculate the balanced accuracy score from sklearn.metrics
  2. Calculate the confusion matrix from sklearn.metrics
  3. Generate a classication report using the imbalanced_classification_report from imbalanced-learn.
Cluster Centroids Undersampling SMOTEENN Combination (Over and Under) Sampling
  • Accuracy score: 0.54
  • Precision
    • High risk: 0.01
    • Low risk: 1.00
  • Recall
    • High risk: 0.69
    • Low risk: 0.40
  • Accuracy score: 0.64
  • Precision
    • High risk: 0.01
    • Low risk: 1.00
  • Recall
    • High risk: 0.71
    • Low risk: 0.57

Balanced Random Forest Classifier VS Easy Ensemble AdaBoost Classifier:

In this section, foll0wing metrics will be provided in order to discover which algorithm results in the best performance between Balanced Random Forest Classifier and Easy Ensemble AdaBoost Classifier

  1. Calculate the balanced accuracy score from sklearn.metrics
  2. Calculate the confusion matrix from sklearn.metrics
  3. Generate a classication report using the imbalanced_classification_report from imbalanced-learn.
Balanced Random Forest Classifier Easy Ensemble AdaBoost Classifier
  • Accuracy score: 0.79
  • Precision
    • High risk: 0.03
    • Low risk: 1.00
  • Recall
    • High risk: 0.70
    • Low risk: 0.87
  • Accuracy score: 0.93
  • Precision
    • High risk: 0.09
    • Low risk: 1.00
  • Recall
    • High risk: 0.92
    • Low risk: 0.94

Summary

Before moving forward with a summary report, I would like to point out a few reminders regarding following metrics:

  1. Classifying a single point can result in a true positive (truth = 1, guess = 1), a true negative (truth = 0, guess = 0), a false positive (truth = 0, guess = 1), or a false negative (truth = 1, guess = 0).
  2. Accuracy measures how many classifications your algorithm got correct out of every classification it made.
  3. Recall measures the percentage of the relevant items your classifier was able to successfully find.
  4. Precision measures the percentage of items your classifier found that were relevant.
  5. Precision and recall are tied to each other. As one goes up, the other will go down.
  6. F1 score is a combination of precision and recall.
  7. F1 score will be low if either precision or recall is low.

I have created 6 different algorithms and models to discover optimum way to predict credit risk. Comparing Accuracy Scores for each model would be a quick and efficient way to decide which model would perform better than others for this data set. We can observe that Easy Ensemble AdaBoost Classifier is performed significantly better than the other models as its Accuracy Scores is around 92%. However, despite of having high Accuracy Scores the F-Score is relatively low as 0.16.Also,having a low precision would cause low risk loans to be labeled as high risk falsely. Making wrong decision on loan application will decrease the revenue and trustworthiness of the bank. Therefore, I would reject using these algorithms as decision mechanisms for predicting credit risk for loan applications. My recommendation would be creating larger scale dataset with more features and better selections on features to improve Precision and F1 scores which improve confidence of our machine learning algorithms.

Resources

  • Data Source: LoanStats_2019Q1.csv
  • Software/Languages: Jupyter Notebook- Google Colab, Python
  • Libraries: Scikit-learn

About

Credit_Risk_Analysis with Machine Learning that classification models employing resampling techniques to evaluate credit risk.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published