Skip to content

Latest commit

 

History

History
174 lines (119 loc) · 8.11 KB

README.md

File metadata and controls

174 lines (119 loc) · 8.11 KB

💰 FinTech Project

As part of the Big Data and AI Engineering Onsite Bootcamp, we are asked to deliver a solution for the Saudi market that can be solved by data science. The project has to have an impact and deliver a solution for a real-world problem using Saudi datasets.

logo_

Table of Contents
  1. Project Overview
  2. Business Objective
  3. Dataset Overview
  4. Preprocessing Overview
  5. Visualization
  6. Modeling Results
  7. Contributing Members Contact
  8. Acknowledgments

Project Overview

This is the overview of the project's structure and files for easier navigation. However, some notebooks and datasets cannot be uploaded either to ensure the company's confidentiality or due to size limits:

├── README.md
├── CapestoneProject_Dashboard_Desert_Ninjas.pdf
├── CapstoneProject_Presentation_Desert_Ninjas.pdf
├── Notebooks
│   ├── CapstoneProject_Pre_Preprocessing_Notebook_ComanyNameEncryption.ipynb 
|   ├── CapstoneProject_Preprocessing_Notebook_Desert_Ninjas.ipynb
│   ├── CapstoneProject_EDA_Notebook_Desert_Ninjas.ipynb
│   └── CapstoneProject_ML_Notebook_Desert_Ninjas.ipynb
└── Datasets
    ├── Encrypted_full_dataset.csv (output of the pre-preprocessing notebook) 
    ├── Encrypted_exported_raw_data.csv (output of the pre-preprocessing notebook) 
    ├── Preprocessed_full_dataset.csv (output of the preprocessing notebook) 
    └── Final_extracted_dataset.csv (used for the EDA, Dashboard, and Machine Learning models)

Note: As a beginning, we were provided with two datasets that contain different schemas (Encrypted_full_dataset + Encrypted_exported_raw_data)

(back to top)

Business Objective

The purpose of this project is to predict potential customers for a FinTech startup company using their visitor's activity logs. Those potential investors would then be targeted with marketing strategies.

Methods Used

  • Preprocessing raw data
  • Feature Engineering
  • Feature Selection
  • Labeling and classifying the data
  • Exploratory Data Analysis
  • Data Visualization
  • Machine Learning
  • Oversampling

Technologies

  • Python, Jupyter
  • Pandas
  • Plotly
  • Sklearn
  • Imbalanced-learn
  • Power BI

(back to top)

Dataset Overview

A startup FinTech company named X is interested in knowing its customers’ behaviors and whether they’re going to invest based on their activity logs. However, the problem has challenges because we don't have the following to support our analysis:

  • The number of visitors to the website
  • The demographics of these visitors

The analysis will help the company create a new marketing strategy for attracting more customers, increasing its revenues, and learning the patterns of customers who reach the investment pages but do not commit to the full transaction. Lucky for the FinTech company, we say, challenge accepted!

At the beginning of our analysis, we raised some questions that we intend to answer using our EDA, dashboard visualization, and modeling. The questions are:

  1. What kind of data does their website collect from users?
  2. What is the path that gets visited by users usually? And how much time do users spend on this path?
  3. Does the average time spent on a page differ based on the user type?
  4. Which path has the maximum time? Is this the path that leads to a successful transaction (investment)? We hope to answer all of these questions in our analysis.

(back to top)

Preprocessing Overview

Preprocessing is the essence of this project. In this README file, we will be listing the overview of each step. However, for a more detailed description, visit our Medium Blog Post.

The dataset before and after the preprocessing:

image

Preprocessing steps:

prep1

Feature engineering steps:

feateng

Features before removing data leakage:

fs1

Selecting the features after removing the data leakage:

fs2

(back to top)

Visualization

Based on our EDA, we found that 80% of our users are regular visitors, while only 17% are investors, thus, we wanted to create two dashboards for these two user types.

Visitors dashboard:

dash1

Investors dashboard:

dash2

As mentioned above, you can visit our web blog for a detailed analysis of the project.

(back to top)

Modeling Results

All of these models were evaluted in order to choose the best one of them.

image

However, in our criteria, since our dataset is imbalanced, we will take recall as our evaluation metric. Also, we want to focus on identifying the potential customers class, so, we took the best model in identifying this class as compared to our baseline; which is XGBoost.

XGBoost results:

image

Baseline Distribution:

image

(back to top)

Contributing Members Contact

Team Leadear: Reema Alaswad (Reema's LinkedIn)

Other Members:

Name LinkedIn
Raghad Aleisa Raghad's LinkedIn
AlJohara Alkanhal AlJohara's LinkedIn
Maha AlHazzani Maha's LinkedIn
Eman Aldosari Eman's LinkedIn

(back to top)

Acknowledgments

(back to top)