This repository contains code for our final project, completed as part of the University of Texas at Austin - Data Analysis and Visualization bootcamp. The project will analyze data related to the Covid-19 pandemic, with the aim of producing a machine learning model that predicts the number of cases/fatalities that a given country can expect in the future.
- Mike Hankinson
- Tan Tran
- Luke Newell
- Keith Rabb
The Covid-19 pandemic has impacted the lives of billions of people all over the globe. Every member of the team has been affected in some way and we would like to use the skills we have learned to contribute to a better understanding of the spread of the virus.
The dataset we are using is owid-covid-data.csv. Our World in Data have collated the data from a number of different sources, which can be found here. They source their data from the following organizations:
- COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU)
- European Centre for Disease Prevention and Control (ECDC)
- Official government reports
- United Nations
- World Bank
- Global Burden of Disease
The live dataset can be found at OurWorldinData.org and all data produced by Our World in Data are open access under the Creative Commons BY license.
- Based on available data, how many cases can a country expect to see over the next two weeks?
- Based on available data, how many fatalities can a country expect to see over the next two weeks?
We performed a introductory analysis on COVID-19 cases and deaths in the USA, after an initial visualization of the worldwide data showed that the USA has been badly affected when compared to other countries. The team visualized the data using Matplotlib to compare the daily numbers with a weekly average. The weekly averaging smoothed out the data, enabling the team to create a smoother curve plot and removing the daily fluctuations that were created.
The analysis phase of the project consisted of a comparison of different machine learning models, testing how each model was able to accurately predict COVID-19 cases and deaths in the USA. The models we tested were: ARIMA, DNN, RNN and FBProphet. Each model had slightly different requirements for analyzing the data and provided the team with different insights into the optimal forecasting method for COVID-19. Further information can be found in the 'Overview of the Machine Learning Analysis' section below.
The following lists the technologies used for this project:
- Python
- Pandas
- Numpy
- Tensorflow
- Jupyter Notebook
- PostgreSQL
- pgAdmin
- Scikit-learn
- TensorFlow
- Keras
- FBProphet
- Matplotlib, Plotly
- ReadMe
- Google Slides
- Tableau
This model will attempt to answer the following questions regarding the CoVID pandemic:
- Based on available data, how many cases can a country expect to see over the next two weeks?
- Based on available data, how many fatalities can a country expect to see over the next two weeks?
In order to accomplish this task, we must employ a novel model type that was not presented within the Data Analytics Bootcamp. Rather than employing classification or clustering models, this analysis must incorporate a supervised regression ML model.
Machine Learning models are not typically applied to time series data. Rather, they are are usually trained using supervised learning; expecting data in the form of samples with inputs and outputs. However, it is possible to perform time series forecasting using ML. In order to do so, the time series data must be transformed into a supervised learning problem.
- Applying Standard ML algorithms to Time-Series forecasting
- Convert a Time Series to a Supervised Learning Problem in Python
- Time series forecasting
- A Quick Deep Learning Recipe: Time Series Forecasting with Keras in Python
- How to Make Out-of-Sample Forecasts with ARIMA in Python
- Quick start Prophet
- Import dependencies
- Import data set from provisional database
- Preprocess Data
- Split the dataset into training and testing.
- Apply a standard scaler
- Difference the data to make it stationary
- Define / Develop Neural Network
- Fit model
- Build features for forecasting
To optimize our ML forecast, we decided to compare various models to understand which would provide the most accurate prediction of future COVID cases and deaths in the USA. After completing some research on models for time series forecasting, we arrived on these models to test:
You can review the code for each model via the links above!
To preprocess the data for use in the machine learning models, we completed the following steps:
- Selected desired columns from the database
- Use fillna function to replace any NaN values with 0
- Convert date column to datetime datatype
For all three models, we decided to complete a time series forecast plotting datetime data against two features: new_cases and new_deaths in the United States. We decided to choose these metrics as they provide the best insight into the spread of the virus and the effect it has on the country. Time series forecast models rely heavily on historical data to predict future values. As such, we are currently using just one feature for each model to predict either new cases or new deaths.
- DNN: Incorporated a helper function, convert2matrix, to reshape the dataset into the correct 2-D DNN input shape.
- RNN: Incorporated a helper function, convert2matrix, to reshape the dataset into the correct 3-D RNN input shape: (batch_size, window size, input_features).
- FBProphet: Used a dataframe containing the historical data with two columns: ds and y (date and feature).
- DNN: For training, we used 75% of the available data minus a 'look back' window of 15 days. The testing data used the remainder of the dataset. The model was built with one hidden layer using the rectified linear (ReLU) activation function. The model compiled both the training and testing loss over a maximum of 100 epochs.
- RNN: For training, we used 75% of the available data minus a 'look back' window of 15 days. The testing data used the remainder of the dataset. Unlike DNN, the RNN model was built with two hidden layers again using the rectified linear (ReLU) activation function. The model compiled both the training and testing loss over a maximum of 100 epochs.
- FBProphet: The training data used a dataframe with all the historical data. For testing, a new dataframe (future) was created to store the future dates and the predicted values were then populated by the model.
- DNN: The Deep Neural Network model was chosen as it works well with a complete dataset and is able to be used with univariate and multivariate data. A limitation of the model is that it also is affected by lagged correlation, but this model does capture the overall trends with a high degree of accuracy.
- RNN: The Recurrent Neural Network is a type of artificial neural model that is specifically desgined to work with time series or sequential data. These models are somewhat unique in that they maintain a 'memory'as they take information from prior inputs to influence the current input/output -- inputs/outputs of this model type are not independent of one another.
- FBProphet: This model was chosen as it is a suitable model for time series forecasting, that uses past trends to predict future values. One limitation is that the model is highly affected by seasonality, which reduces the benefit as we only have ~420 days of data and so yearly trends are not able to be determined. But, the model does allow for weekly and monthly variations, which will come in particularly useful when analyzing COVID-19 metrics.
- DNN: This model was trained over 100 epochs using the adam optimizer
- RNN: This model was trained over 100 epochs using the adam optimizer
- FBProphet: This model was trained using the FBProphet fit function with over 420 days of historical data
We plan to train the FBProphet Models further, as the accuracy score for the deaths model is currently much lower than desired.
Above can be found the final accuracy scores for the machine learning models used in the project. This chart provides a direct comparison between the models, using a 7 day rolling average of covid-19 cases and deaths. As we looked two weeks into the future, the models performed reasonably well in the short term, but may require additional training to be able to perform to the same accuracy over a longer period of time. The DNN and RNN models performed best, but may be overfit and reliant on recent data to develop their predictions. The FBProphet cases model performed well, but when used with the deaths data did not provide an accuracte result. If we were to apply the same univariate approach to other countries, we'd use the most simple of the 3 models (DNN) as it requires less time and computing power to train the model, while still providing a consistently accurate result.
- Added additional features to the machine learning model to include vaccination rates
- Predicted the total number of cases instead of the daily cases, as the cumulative number is more consistent, which may be easier for the ML models to interpret than the daily number
- Compared the FBProphet and the DNN/RNN models graphically
- Applied a multivariate model with numerous features using the RNN model because of the ability to maintain "memory" through a feedback loop.
To visualize the data, we are using Tableau with the COVID data imported from our database in pgAdmin. We have created a number of plots that show the spread of the virus around the world, including interactive elements that are described below. From these visualizations, the user can understand which countries have been most affected by the coronavirus, where the most new cases are arising, and which countries have vaccinated the largest number of their residents.
You can find the Interactive Dashboard here and here where the user can click on the world map and select their country of choice as well as interact with model outputs. You can also check out our Vaccination Visualization which shows the current progress of vaccinations in the USA.
- Website Group9webpage.com
- Presentation in Google Slides Group 9 Presentation
- Team Practice Session