Stroke Prediction Model

This repository contains code for a stroke prediction model based on the "Stroke Prediction" Dataset, it includes various features related to individuals health and lifestyle; whether they have experienced a stroke or not. our aim is to develop a model that can predict the likelihood of a stroke based on these features. It contains a number of features such as pertaining to people's health, lifestyle choices, and stroke history. The objective is to create a model that uses these characteristics to forecast the chance of a stroke.

Dataset

https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

According to the World Health Organization stroke is the second leading cause of death in the world, responsible for approximately 11% of deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.

Dependencies

The following dependencies are required to run the code in this repository:

Python
Pandas
Seaborn
Scipy
Numpy
Matplotlib
Scikit-learn
Imbalanced-learn
Xgboost
Lightgbm
Catboost

Code_Structure

The code is organized into the following sections:

Data Preprocessing: In this step, the code reads and combines the training and testing data, handles missing values, encodes categorical variables, and adds additional features.
Feature Selection: The importance of features is determined using a Random Forest Classifier. The code selects the top-k features for further analysis.
Model Training: Two models, Logistic Regression and XGBoost, are trained on the selected features. The models are evaluated on the validation set using various performance metrics, including accuracy, precision, recall, F1-score, ROC AUC score, and PR AUC score.
Model Evaluation: The trained models are evaluated on the validation set, and the code displays their performance metrics.
Model Testing: The final trained models, Logistic Regression and XGBoost, are used to make predictions on the test data. The code stores the predictions in a CSV file named "results_ronaldinho.csv".
Ensemble Model: An ensemble model is implemented using a voting classifier. It combines the trained Logistic Regression, XGBoost, and LightGBM models. The code evaluates and displays the performance of the ensemble model.
CatBoost Classifier: The code trains and evaluates a CatBoost Classifier on the validation set.

Results

Metric	Score
Accuracy	0.8222
Precision	0.9562
Recall	0.7628
F1-score	0.8745
ROC AUC score	0.8769
PR AUC score	0.1875
Kaggle Submission	0.814

Authors

Fatih Emir Güler
Mert Erbak
Mohammed Ammar Salahie

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
README.md		README.md
checkpoints.txt		checkpoints.txt
combined_data.csv		combined_data.csv
healthcare-dataset-stroke-data.csv		healthcare-dataset-stroke-data.csv
results.csv		results.csv
stroke_prediction.py		stroke_prediction.py
stroke_prediction_model_map.png		stroke_prediction_model_map.png
stroke_prediction_pipeline_report.pdf		stroke_prediction_pipeline_report.pdf
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Prediction Model

Dataset

Table of Contents

Context

Dependencies

Code_Structure

Results

Authors

About

Releases

Packages

Contributors 3

Languages

merterbak/stroke-prediction-model

Folders and files

Latest commit

History

Repository files navigation

Stroke Prediction Model

Dataset

Table of Contents

Context

Dependencies

Code_Structure

Results

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages