Course: ENCS5341 - Machine Learning and Data Science
Institution: Electrical and Computer Engineering Department, Birzeit University
- Project Overview
- Dataset Description
- Models and Methods
- Results
- Conclusion
- Installation and Usage
- Repository Structure
- Contributors
- License
- Acknowledgments
Cardiovascular diseases (CVDs) are the leading cause of death globally, claiming approximately 17.9 million lives each year. Early detection and management are crucial to reduce mortality rates. This project explores the application of machine learning techniques to predict the likelihood of heart failure in patients using clinical and demographic data.
The primary objectives of this project are:
- To implement and evaluate different machine learning models for predicting heart failure.
- To perform exploratory data analysis (EDA) to understand the dataset.
- To tune hyperparameters for optimal model performance.
- To analyze the performance of the best model and interpret the results.
The dataset used in this project is the Heart Failure Prediction Dataset obtained from Kaggle. It contains 918 patient records with 12 attributes, including both numerical and categorical features.
Age
: Age of the patient (years)Sex
: Sex of the patient (M
,F
)ChestPainType
: Type of chest pain experiencedRestingBP
: Resting blood pressure (mm Hg)Cholesterol
: Serum cholesterol (mm/dl)FastingBS
: Fasting blood sugar (1
if > 120 mg/dl, else0
)RestingECG
: Resting electrocardiogram resultsMaxHR
: Maximum heart rate achievedExerciseAngina
: Exercise-induced angina (Y
,N
)Oldpeak
: ST depression induced by exercise relative to restST_Slope
: The slope of the peak exercise ST segmentHeartDisease
: Output class (1
for presence,0
for absence)
- Handling Missing Values: The dataset had no missing values.
- Outlier Detection: Identified and treated outliers using the Interquartile Range (IQR) method.
- Encoding Categorical Variables: Converted categorical variables to numerical using label encoding.
- Feature Selection: Used Sequential Feature Selection with a Gradient Boosting Classifier to select the most relevant features.
- K Values Tested: 1 to 10
- Best K Value: 7
- Performance:
- Accuracy: 81.16%
- Precision: 84.83%
- Recall: 80.39%
- F1 Score: 82.55%
-
Random Forest (Selected Model)
- Hyperparameter Tuning:
- Number of Estimators: 50
- Criterion: 'gini'
- Max Depth: None
- Min Samples Split: 4
- Min Samples Leaf: 1
- Performance:
- Training Accuracy: 98.13%
- Testing Accuracy: 84.06%
- Cross-Validation Accuracy: 88.17%
- Precision: 86.58%
- Recall: 84.31%
- F1 Score: 85.43%
- Hyperparameter Tuning:
-
Support Vector Machine (SVM)
- Performance:
- Training Accuracy: 91.90%
- Testing Accuracy: 82.97%
- Cross-Validation Accuracy: 87.24%
- Precision: 82.75%
- Recall: 82.89%
- F1 Score: 82.81%
- Performance:
-
Multilayer Perceptron (MLP)
- Performance:
- Training Accuracy: 90.34%
- Testing Accuracy: 83.70%
- Cross-Validation Accuracy: 86.93%
- Precision: 85.06%
- Recall: 85.62%
- F1 Score: 85.34%
- Performance:
-
Logistic Regression
- Performance:
- Training Accuracy: 85.67%
- Testing Accuracy: 81.16%
- Cross-Validation Accuracy: 84.43%
- Precision: 84.83%
- Recall: 80.39%
- F1 Score: 82.55%
- Performance:
The Random Forest model outperformed the other models, achieving the highest accuracy and balanced performance across precision, recall, and F1-score.
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
Negative (0) | 81% | 84% | 82% | 123 |
Positive (1) | 87% | 84% | 85% | 153 |
Accuracy | 84% | 276 |
- The Random Forest model effectively captured the complex relationships in the data.
- Hyperparameter tuning significantly improved model performance.
- The selected features contributed positively to the model's predictive capability.
This project demonstrates the potential of machine learning models in predicting heart failure risk using clinical data. The Random Forest model, with optimized hyperparameters, provided the best performance. The findings underscore the importance of feature selection and hyperparameter tuning in developing effective predictive models in healthcare.
- Overfitting: The high training accuracy indicates potential overfitting.
- Data Imbalance: Slight imbalance in the target classes may affect model performance.
- Generalizability: The model's applicability to other datasets or populations requires further validation.
- Python 3.x
- Jupyter Notebook or JupyterLab
- Required Python libraries:
numpy
,pandas
,matplotlib
,seaborn
,scikit-learn
- Clone the Repository
git clone https://github.com/yourusername/heart-failure-prediction.git
- Navigate to the Project Directory
cd heart-failure-prediction
- Install Required Libraries
pip install -r requirements.txt
Open the Jupyter Notebook in your preferred environment:
jupyter notebook Heart_Failure_Prediction.ipynb
Execute the cells sequentially to reproduce the analysis and results.
heart-failure-prediction/
├── data/
│ └── heart_failure_data.csv
├── images/
│ └── eda_plots/
│ ├── distribution_age.png
│ ├── correlation_heatmap.png
│ └── ...
├── Project_Description.pdf
├── Heart_Failure_Prediction - ML Report.pdf
├── Heart_Failure_Prediction - ML Code.ipynb
├── README.md
├── requirements.txt
└── LICENSE
- data/: Contains the dataset used in the project.
- images/: Visualizations and plots generated during EDA. (Not Available yet)
- Project_Description.pdf: Detailed project description document.
- Heart_Failure_Prediction - ML Report.pdf: Final report summarizing the machine learning analysis and results.
- Heart_Failure_Prediction - ML Code.ipynb: The main Jupyter Notebook with all code and analysis.
- requirements.txt: List of Python libraries required.
- LICENSE: License information.
- Eyab Ghifari
- Hamza Awashra
Instructor: Dr. Yazan Abu Farha
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset Source: Heart Failure Prediction Dataset by Fedesoriano on Kaggle
- Institution: Birzeit University
- Course: ENCS5341 - Machine Learning and Data Science
This project was completed as part of the coursework for ENCS5341 at Birzeit University, aiming to apply machine learning techniques to a real-world healthcare problem.