This project utilizes real-world and synthetic datasets to predict stroke events by analyzing clinical features. The aim is to determine the most key risk factors for strokes by investigating parameters like gender, age, hypertension, heart disease, and lifestyle choices.
To install requirements:
pip install -r requirements.txt
Dataset::
Stroke Prediction Dataset from Kaggle website
Kaggle Dataset 1
Kaggle Dataset 2
- id: Patient ID
- gender: "Male", "Female" or "Other"
- age: patient age
- hypertension: 0 if the patient does not have hypertension, 1 if the patient does not have hypertension
- heart_disease: 0 if the patient does not have heart disease, 1 if the patient has heart disease
- ever_married: "No" or "Yes"
- work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- Residence_type: "Rural" or "Urban"
- avg_glucose_level: average blood sugar
- bmi: body mass index
- smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"
In this project, we perform data cleaning to ensure the dataset is ready for analysis.
Missing values in the ’bmi’ column was filled with the
mean values which are calculated separately for the cases with and without strokes.
To work with categorical variables for further analysis, data encoding was used. Converting categorical variables into numerical format using the factorize function, enhances the datasets’ relevance for modelling.
From a Python library which is scikit-learn we used 6 different machine learning models.
GradientBoostingClassifier
,SVC
, LogisticRegression
,DecisionTreeClassifier
, Xgboost
, and RandomForestClassifier
.
We also used regressor models just to see how they worked. Regressor models are not decent options to use in binary classification problems.
Because regressor models used in continuous variables.
Our model achieves the following performance:
Classification algorithm | Accuracy | Accuracy with hyperparameter tuning |
---|---|---|
GradientBoostingClassifier | 83.73% | 83.73% |
LogisticRegression | 79.20% | 79.47% |
RandomForestClassifier | 99.32% | 99.32% |
SVC | 79.58% | |
DecisionTreeClassifier | 98.00% | 98.16% |
XGBClassifier | 95.16% |