From 3b4b11a4aea4b5b8424805641771c20cfb785d71 Mon Sep 17 00:00:00 2001 From: Sarah Eshafi Date: Sun, 8 Dec 2024 10:36:30 -0800 Subject: [PATCH] editing quarto file --- README.md | 20 ++ reports/heart_diagnostic_analysis.html | 224 +++++++++++++++++- reports/heart_diagnostic_analysis.qmd | 49 +++- .../libs/bootstrap/bootstrap.min.css | 4 +- scripts/3_eda.py | 2 - 5 files changed, 276 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 53ff740..383cdb4 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,26 @@ Copy the link from the output (the link would look like below) and paste it to your browser and change the port number from `8888` to `9999` to launch jupyter notebook. ![Jupyter-lab](img/9999.png) +#### 2\. Running the Analysis +To run the analysis, open a terminal and run the following commands: +``` +python scripts/1_download_decode_data.py --id=45 --write-to=data/raw + +python scripts/2_data_split_validate.py --split=0.1 --raw-data=data/raw/pretransformed_heart_disease.csv --write-to=data/processed + +python scripts/3_eda.py --train data/processed/train_df.csv --test data/processed/test_df.csv --write-to results + +python scripts/4_training_models.py --train data/processed/train_df.csv --write-to results + +python scripts/5_evaluate.py --train data/processed/train_df.csv --test data/processed/test_df.csv --write-to results + +quarto render report/heart_diagnostic_analysis.qmd --to html +quarto render report/heart_diagnostic_analysis.qmd --to pdf +``` + +#### 3\. Clean Up +To shut down the container and clean up the resources, type Cntrl + C in the terminal where you launched the container, and then type `docker compose rm`. + ## Dependencies - conda (version 24.7.1 or higher) - conda-lock (version 2.5.7 or higher) diff --git a/reports/heart_diagnostic_analysis.html b/reports/heart_diagnostic_analysis.html index ffb5584..ea0df55 100644 --- a/reports/heart_diagnostic_analysis.html +++ b/reports/heart_diagnostic_analysis.html @@ -2,7 +2,7 @@ - + @@ -58,10 +58,41 @@ - +
- +
@@ -113,7 +144,7 @@

2. INTRODUCTION

  • Thalassemia : categorical feature indicating if patient suffered from Thalassemia
  • The following sections will discuss the decisions made and results in our Exploratory Data analysis, Machine learning model training, and final model performance

    -

    This report also drew information from the study done by (O?’Flaherty2008) and (Athanasios Aessopos 2007)

    +

    This report also drew information from the study done by (O’Flaherty et al. 2008) and (Athanasios Aessopos 2007)

    3. DATA VALIDATION & CLEANING

    @@ -131,7 +162,36 @@

    4. METHOD

    4.1 EDA

    In this section, preliminary analysis is conducted to obtain an idea of possible correlations between features to be on the look out for. the results are presented below:

    -

    -figure of eda

    +
    +
    +
    + +
    +
    +Figure 1: Distributions of Categorical Features +
    +
    +
    +
    +
    +
    + +
    +
    +Figure 2: Distributions of Numeric Features +
    +
    +
    +
    +
    +
    + +
    +
    +Figure 3: Correlation Matrix +
    +
    +
    @@ -165,6 +225,102 @@

    4.2.5. Model Evaluation

    Model evaluation and selection

    Comparing the metrics across models, Balanced logistic regression yields the highest recall and with second to the highest f1_score. For this report, we choose to proceed with LogisticRegression(class_weight="balanced") and optimize the f1 score metric along with optimizing recall as we want to minimize False Negatives, which is more damaging in medical diagnosis than False Positives.

    +
    +
    +
    +
    +Table 1: Comparison of cross-validation scores across model options. +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    dummylogregsvclogreg_balsvc_bal
    0.0020.010.0090.010.01
    0.0110.0070.0060.0060.008
    0.7460.7970.8180.7310.747
    0.7460.850.8960.8220.916
    00.6380.8240.4770.524
    00.7890.9690.6090.779
    00.440.380.680.62
    00.560.610.840.94
    00.5170.5120.5550.563
    00.6550.7470.7060.851
    +
    +
    +
    +
    +

    4.2.6. Hyperparameter optimization

    @@ -178,16 +334,65 @@

    Hyperp

    4.2.7 Final model Scoring and Evaluation

    With hyperparameters selected the best model is fitted on the training set, then scored on both data sets. F1 score, Recall score and Accuracy are all computed across both data sets.

    +
    +
    +
    + +
    +
    +Figure 4: Confusion Matrix on Test Data +
    +
    +
    +
    +
    +
    +
    +Table 2: Best model metrics. +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + +
    MetricTrainTest
    F1 Score0.6270.333
    Recall0.740.4
    Accuracy0.7770.636
    +
    +
    +
    +
    +

    5. Results/Discussion:

    -

    The model created is promising. Applying it on our data set gave approximately 68 [inline code] percent accuracy. This value is close to our baseline dummy accuracy, however with hyper parameter tuning the model achieved a higher F1 score compared to original model (0.68 [inline code] on train F1 score compared to 0.61 [inline code] F1 score for base model cv score). The final F1 score on test data is found to be 0.533 [inlinecode]. Most importantly for our application, the final model performed moderately well at minimizing false negatives. The Recall score for our final model applied to the training data set was 0.830, (improving from the 0.73 [inline code] recall of the original model). On our testing data set, the model performed moderately well, returning a recall value of 0.500 [inline code].

    -

    The discrepancy between training and testing score may be due to the fact that the test data set was quite small (22 examples). To get more rigorous performance testing and confidence in our result, it would be recommended to seek further data.

    +

    The model created is promising. Applying it on our data set gave approximately 63.6 percent accuracy. This value is close to our baseline dummy accuracy, however with hyper parameter tuning the model achieved a higher F1 score compared to original model (0.627 on train F1 score compared to 0.609 F1 score for base model cv score). The final F1 score on test data is found to be 0.333. Most importantly for our application, the final model performed moderately well at minimizing false negatives.
    +On our testing data set, the model performed moderately well, returning a recall value of 0.4. The discrepancy between training and testing score may be due to the fact that the test data set was quite small (22 examples). To get more rigorous performance testing and confidence in our result, it would be recommended to seek further data.

    6. Conclusion

    -

    The model created showed some promise, being able to correctly classify presence of angiographic coronary disease with a decent level of accuracy (~68%). As well, the model performed moderately well on F1 score and was able to minimize the number of false negatives classified (recall around 50.0%)

    +

    The model created showed some promise, being able to correctly classify presence of angiographic coronary disease with a decent level of accuracy (63.6 percent). As well, the model performed moderately well on F1 score and was able to minimize the number of false negatives classified (recall of 0.4).

    There are some limitations to this report that should be noted both at the analysis level and application level.

    On the analysis side, only 2 models were tested. While their performance was encouraging, a more rigorous approach would test a variety of classifiers before proceeding with logistic regression.

    As well, further hyperparameter optimization could be conducted. While a wide range of C-values were tested, Only 50 possible values were tested from this range. An improvement to this would be to randomly sample from a log-uniform distribution to obtain our best C value.

    @@ -210,6 +415,9 @@

    6. Conclusion

    Master, Arthur M. 1969. “The Extent of Completely Asymptomatic Coronary Artery Disease.” https://doi.org/10.1016/0002-9149(69)90064-2.
    +
    +O’Flaherty, M, E Ford, Steven Allender, P Scarborough, and S Capewell. 2008. Coronary heart disease trends in England and Wales from 1984 to 2004 : concealed levelling of mortality rates among young adults,” January. https://dro.deakin.edu.au/articles/journal_contribution/Coronary_heart_disease_trends_in_England_and_Wales_from_1984_to_2004_concealed_levelling_of_mortality_rates_among_young_adults/21047827. +