Skip to content

Latest commit

 

History

History
92 lines (73 loc) · 6.25 KB

README.md

File metadata and controls

92 lines (73 loc) · 6.25 KB

Data Mining Projects

Table of Contents

Introduction

This repository contains projects of Data Mining Course. There are four projects.

  1. first project, is about preprocessing that we use pandas and Scikit-learn for this purpose. The dataset we use is iris-dataset.

  2. second project, is about creating a Neural Network and we train the model for two datasets: make_circles and fashion_mnist.

  3. third project, is about Association Rules and also Clustering, There are two different projects.

  4. and final project, we train a model with a dataset which has more than 70k records and we should decide whether a person has a special disease or not, for this project we use XGBoost that is a decision tree.

💻 First Project

This project is about preprocessing that we use pandas and Scikit-learn for this purpose. The dataset we use is iris-dataset which can be downloaded by this link. i do the following steps for preprocessing iris-dataset.

  1. handle missing values and find NaN values and fill them with proper values or remove them.
  2. convert categorical features to numerical features by Label Encoding and One Hot Encoding.
  3. nomalize data frame by the help of Standard Scalar.
  4. dimension reduction with PCA.
  5. visualization.

The visualization of the final result is:


you can access to project and code by this link.

💻 Second Project

This project is about creating a Neural Network and we train the model for two datasets: make_circles and fashion_mnist.

for first dataset (make_circles) i follow these steps:

  1. make 1000 circles.
  2. split train and test dataset.
  3. create a Neural Network with two hidden layers.
  4. train model.
  5. plot loss and accuracy.

for acctivation functions i used relu for hidden layers and sigmoid for the output layer and binary_crossentropy for loss function.
You can access to code of this section by this link.

for second dataset (fashion_mnist) i follow these steps:

  1. load dataset.
  2. split train and test dataset.
  3. create a Convolutional Neural Network with two hidden layers.
  4. train model.
  5. plot loss and accuracy.
  6. print confusion_matrix and classification_report.

for acctivation functions i used relu for hidden layers and softmax for the output layer and categorical_crossentropy for loss function and adam for optimizer.
You can access to code of this section by this link.

💻 Third Project

This project is about Association Rules and also Clustering, There are two different projects.

for clustring project i did these tasks:

  1. working with KMeans library from sklearn.cluster and plotting result.
  2. determining efficient number of clusters with two methods: elbow and PCA.
  3. working with complex datasets and clustering them.
  4. working with load_digits dataset and cluster this dataset.
  5. dimenshion reduction of a picture.
  6. do DBSCAN algorithm for two datasets.
  7. determining efficient value for MinPts and epsilon.
  8. plotting results and comparison results.

You can access to code of this section by this link.

for Association Rules project i did these tasks:

  1. working with Apriori algorithm.
  2. load this dataset and preprocess it and create dataframe.
  3. find frequent_itemsets and print them.
  4. extract association_rules.

You can access to code of this section by this link.

💻 Fourth Project

And finally in this project, I train a model with a dataset which has more than 70k records that you can download it by this link and we should decide whether a person has diabete disease or not, for this project we use XGBoost that is a decision tree.
Each record has 21 features and with these 21 features we should decide whether a person has diabete or not.

for doing this i did these tasks respectively:

  1. preprocessing data (load dataset, rename column names, fill Null values with mode, normalizing, convert categorical features to numerical features with OneHotEncoding and Min-Max, split label column for our dataset).
  2. build model (split train and test data, create XGBClassifier, train model, print accuracy, plot confusion_matrix, plot tree, print precision and recall).
  3. parameter tuning with the help of GridSearchCV and determine best parameters.
  4. plot metric changes
    .

You can access to code of this section by this link.

Technologies

Project is created with:

  • Python version: 3.8