bayesianStatistics_project.Rmd

---
title: "Bayesian Statistics"
subtitle: "Final Project"
author: "Leonardo Stincone"
date: "17th June 2019"
output:
  html_document:
    toc: true
editor_options: 
  chunk_output_type: console
---

\newcommand*\diff{\mathop{}\mathrm{d}}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.align='center')
```

```{r, libraries, message = F}
# Libraries

library(SemiPar)   # For the dataset
library(rstan)
library(rstanarm)
library(GGally)
library(arm)
library(gridExtra)
library(plotly)
library(mgcv)       # For GAM model
# library(gamm4)      # For GAM model
library(loo)        # For bayesian model comparison
library(lme4)
library(bayesplot)
library(knitr)
library(kableExtra) # For kable style table
library(tidyverse)
library(plyr)       # For the mapvalue function

# Set plot themes
theme_set(theme_bw())
rstan_options(auto_write = TRUE)
```


# Problem description

In this project, I analysed a dataset regarding deaths in Milan. The dataset contains, for each day between 1/01/1980 and 31/12/1989:

* the number of deaths occurred in Milan (`tot.mort`);
* the number of deaths for respiratory issues in Milan (`resp.mort`).

We have for each day some explanatory variables too:

* the mean temperature (`mean.temp`);
* the relative humidity (`rel.humid`);
* the sulphur dioxide level in ambient air (`SO2`);
* the total suspended particles in ambient air (`TSP`).

The dataset has been described in [Vigotti, M.A., Rossi, G., Bisanti, L., Zanobetti, A. and Schwartz, J. (1996). Short term effect of urban air pollution on respiratory health in Milan, Italy, 1980-1989. Journal of Epidemiology and Community Health, 50, 71-75.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1060893/) and it has been analysed in [Ruppert, D., Wand, M.P. and Carroll, R.J. (2003), Semiparametric Regression Cambridge University Press.](http://stat.tamu.edu/~carroll/semiregbook/).

The aim of the project is to build a predictive model for `tot.mort` and `resp.mort` and to understand which are the variables that mostly affect the probability of death.

# Explorative Data Analysis

```{r, readData, cache = T}
theme_set(theme_bw())

data(milan.mort)

mort <- as_tibble(milan.mort)
rm(milan.mort)
```

```{r dataWrangling1, cache = T}
# Day of week
mort <- mort %>% 
  mutate(day.of.week = factor(day.of.week))

mort <- mort %>% 
  mutate(SO2_log = log(mort$SO2 + 30))

mort <- mort %>% 
  mutate(day.of.year = day.num %% 365,
         year = day.num %/% 365,
         day.date = as.Date("1979-12-31") + day.num) %>% 
  select(day.num, day.date, day.of.year, year, day.of.week, holiday,
         mean.temp, rel.humid, SO2, SO2_log, TSP,
         everything())

```


## Univariate descriptive analysis

### Response variables

The following plots show the distribution of `tot.mort` and `resp.mort`. As we can see, the determinations of `tot.mort` are big enough to model it as a continuous variable, while the determinations of `resp.mort` are much lower.

```{r, cache = T}
p1 <- ggplot(mort,
       aes(x = tot.mort)) +
  geom_histogram(binwidth = 2)

p2 <- ggplot(mort,
       aes(x = resp.mort)) +
  geom_bar()

grid.arrange(p1, p2,
             ncol = 2)
```


#### Milan population

The following plot shows the evolution of the Milan population, according to the ISTAT census. The data can be found [here](https://www.tuttitalia.it/lombardia/18-milano/statistiche/censimenti-popolazione/).
As we can see, the population between 1980 and 1990 decreased.
In order to take into account the decrement of the number of people subjected to the risk, I computed the population day by day with a linear interpolation of the census dates and used this piece of data as an exposure in the models.
With this information I computed the mortality rate as:
$$
Mortality Rate = \frac{tot.mort}{population}
$$
Note that this is just a generic mortality rate, not a specific one. So, we are not considering the age distribution of the population. Given that within the period 1980-1990 the Milan population aged, even If the mortality at each age had not changed, the generic mortality rate would have increased.

```{r, census_readData, message = F, cache = T}
census <- read_csv("data/milanoCensus.csv",
         col_names = c("year", "date", "pop"))

census <- census %>% 
  mutate(pop_m = pop/1e6,
         pop_k = pop/1e3,
         interest = (year %in% c(1971, 1981, 1991)))
```


```{r, census_plot, cache = T}
ggplot(data = census,
       aes(x = year, y = pop_m)) +
  geom_point(size = 2) +
  geom_area(alpha = .2, color = "black", fill = "blue") +
  scale_y_continuous(limits = c(0, max(census$pop_m) + .1)) +
  scale_x_continuous(breaks = census$year,
                     minor_breaks = census$year) +
  labs(x = "year", y = "population (millions)",
       title = "Milan population", subtitle = "Source: ISTAT census")
```

```{r, popInterpolation, cache = T}
census <- census %>% 
  separate(date, into = c("day", "mese")) %>% 
  mutate(month = as.integer(factor(mese, levels = c("gennaio",
                                         "febbraio",
                                         "marzo",
                                         "aprile",
                                         "maggio",
                                         "giugno",
                                         "luglio",
                                         "agosto",
                                         "settembre",
                                         "ottobre",
                                         "novembre",
                                         "dicembre")))) %>% 
  mutate(date = ISOdate(year, month, day))
  

day_num <- (365*10+2)

pop_cens1 <- census$pop_m[census$year == 1981]
pop_cens2 <- census$pop_m[census$year == 1991]

delta_pop1 <- (census$pop_m[census$year == 1981] - census$pop_m[census$year == 1971]) /
  as.integer(difftime(census$date[census$year == 1981], census$date[census$year == 1971], units = "days"))

delta_pop2 <- (census$pop_m[census$year == 1991] - census$pop_m[census$year == 1981]) /
  as.integer(difftime(census$date[census$year == 1991], census$date[census$year == 1981], units = "days"))


# mort

day_cens1 <- as.integer(difftime(census$date[census$year == 1981], ISOdate(1979, 12, 31), units = "days"))
day_cens2 <- as.integer(difftime(census$date[census$year == 1991], ISOdate(1979, 12, 31), units = "days"))


mort <- mort %>% 
  mutate(interp = day.num < day_cens1) %>% 
  mutate(pop_m = ifelse(interp,
                        pop_cens1 - delta_pop1 * (day_cens1 - day.num),
                        pop_cens1 + delta_pop2 * (day.num - day_cens1)
                        ))

mort <- mort %>% 
  mutate(tot_mort_prob = tot.mort / pop_m,
         resp_mort_prob = resp.mort / pop_m)
```


### Explanatory variables

The following plots show the distribution of the explanatory variable. As we can see, `SO2` is strongly skewed. To deal with this problem, I logarithmically transformed it. Specifically, I computed the following transformation: `SO2_log = log(SO2 + 30)`

```{r, explanatory_plot, fig.width=10, fig.height=8, cache = T, message = F}
mort %>% 
  dplyr::select(mean.temp, rel.humid, SO2, SO2_log, TSP) %>% 
  gather(key = "variable", value = "value") %>% 
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram() +
  facet_wrap(~variable, ncol = 2, scales = "free") +
  labs(x = "")

# p1 <- ggplot(mort,
#        aes(x = mean.temp, y = ..density..)) +
#   geom_histogram(binwidth = 1)
# 
# p2 <- ggplot(mort,
#        aes(x = rel.humid, y = ..density..)) +
#   geom_histogram(binwidth = 3)
# 
# p3 <- ggplot(mort,
#        aes(x = SO2, y = ..density..)) +
#   geom_histogram(binwidth = 15)
# 
# p4 <- ggplot(mort,
#        aes(x = SO2_log, y = ..density..)) +
#   geom_histogram(binwidth = .1)
# 
# p5 <- ggplot(mort,
#        aes(x = TSP, y = ..density..)) +
#   geom_histogram(binwidth = 10)
# 
# grid.arrange(p1, p2, p3, p4, p5,
#              ncol = 2)
```


## Multivariate descriptive analysis


### Seasonal effect

#### Mortality

<!-- In the following plots, we can see the evolution of both total and respiratory mortality rates through time. -->
The following plots show the trends of both total and respiratory mortality rates with respect to time. As we can see, there is a strong seasonal effect, with a higher mortality in winter and a lower mortality in summer. We can spot some outliers in summer 1983 and in the first quarter of 1986. In those periods, the mortality was much higher than the one registered in the same period in other years.

```{r, seasonal_totMort, fig.width=10, fig.height=6, cache = T}
mort %>% 
  dplyr::select(day.date, year, tot_mort_prob, resp_mort_prob) %>% 
  gather(key = "variable", value = "value", tot_mort_prob, resp_mort_prob) %>% 
  # mutate(variable = plyr::mapvalues(factor(variable), from = c("tot_mort_prob", "resp_mort_prob"),
  #                             to = c("Mort. Rate", "Resp. Mort. Rate")),
  #        variable = ordered(variable, levels = c("Mort. Rate", "Resp. Mort. Rate"))) %>% 
  mutate(variable = plyr::mapvalues(factor(variable, levels = c("tot_mort_prob", "resp_mort_prob")),
                                    from = c("tot_mort_prob", "resp_mort_prob"),
                              to = c("Mort. Rate", "Resp. Mort. Rate"))) %>% 
  ggplot(aes(x = day.date, y = value)) +
  geom_point(aes(col = factor(year)),
             size = 1, alpha = 1) +
  geom_smooth(method = "loess", span = .1) +
    scale_x_date(breaks = as.Date(str_c(1980:1990, "-01-01"))) +
  scale_color_brewer(palette = "RdBu") +
  theme_dark() +
  theme(legend.position = "none") +
  # facet_wrap(~variable, ncol = 1, scales = "free") +
  facet_grid(variable~., scales = "free") +
  labs(x = "date", y = "",
       title = "Mortality rate in Milan through time")

# ggplot(data = mort,
  #      aes(x = day.date, y = tot_mort_prob)) +
  # geom_point(aes(col = factor(year)),
  #            size = 1, alpha = 1) +
  # geom_smooth(method = "loess", span = .1) +
  # # scale_x_continuous(breaks = seq(0, 3650, by = 365)) +
  # scale_x_date(breaks = as.Date(str_c(1980:1990, "-01-01"))) +
  # scale_color_brewer(palette = "RdBu") +
  # labs(x = "date", y = "Mortality rate (per million people)",
  #      title = "Mortality rate in Milan through time") +
  # theme_dark() +
  # theme(legend.position = "none")
```

In the following plot, we can compare the mortality pattern through the year between different years. As we can see, they are quite similar.

```{r, seasonal_mortTot_plotly, fig.width=10, fig.height=5, cache = T}
p1 <- mort %>% 
  filter(year<10) %>% 
  ggplot(aes(x = day.of.year, y = tot_mort_prob,
             col = factor(year))) +
  # geom_point(size = 1, alpha = .1) +
  geom_smooth(method = "loess", se = F) +
  scale_x_continuous(breaks = seq(0, 365, by = 90)) +
  scale_color_brewer(palette = "RdBu") +
  labs(x = "Day of year", y = "Mortality rate (per million people)",
       title = "Mortality rate in Milan through the year",
       color = "year") +
  theme_dark()

ggplotly(p1)
```


```{r, seasonal_respTot_plotly, fig.width=10, fig.height=5, cache = T}
p2 <- mort %>% 
  filter(year<10) %>% 
  ggplot(aes(x = day.of.year, y = resp_mort_prob,
             col = factor(year))) +
  # geom_point(size = 1, alpha = .1,) +
  geom_smooth(method = "loess", se = F) +
  scale_x_continuous(breaks = seq(0, 365, by = 90)) +
  scale_color_brewer(palette = "RdBu") +
  labs(x = "date", y = "Respiratory mortality rate (per million people)",
       title = "Mortality rate for respiratory issues in Milan through time",
       color = "year") +
  theme_dark()

ggplotly(p2)
```

<!-- Correct the tooltip -->


#### Explanatory variables

<!-- In the following plots, we can see the trend of the explanatory variables through time. -->
The following plots show the trends of the explanatory variables with respect to time. As we can see, all these variables have a strong dependency with time. That means that an eventual strong correlation between these variables and the mortality could be due to a [spurious correlation]( https://www.tylervigen.com/spurious-correlations).
For example, during winter:

* people work more, so they are more stressed and accidents on the job are more frequent;
* light time is shorter, and that has an impact on health;
* there is more traffic, so car accidents are more frequent.

To deal with this problem, I inserted the period of the year in the model with a spline computed on the day of the year.

Observing the `mean.temp` trend we can see that in summer 1983, it was hotter than other summers. That could partially explain the higher mortality of that period.

```{r, fig.width=10, fig.height=10, cache = T}
mort %>% 
  dplyr::select(day.date, year, mean.temp, rel.humid, SO2_log, TSP) %>% 
  gather(key = "variable", value = "value", mean.temp, rel.humid, SO2_log, TSP) %>% 
  ggplot(aes(x = day.date, y = value)) +
  geom_point(aes(col = factor(year)),
             size = 1, alpha = 1) +
  geom_smooth(method = "loess", span = .1) +
    scale_x_date(breaks = as.Date(str_c(1980:1990, "-01-01"))) +
  scale_color_brewer(palette = "RdBu") +
  theme_dark() +
  theme(legend.position = "none") +
  # facet_wrap(~variable, ncol = 1, scales = "free") +
  facet_grid(variable~., scales = "free") +
  labs(x = "date", y = "")


# ggplot(data = mort,
  #      aes(x = day.date, y = mean.temp)) +
  # geom_point(aes(col = factor(year)),
  #            size = 1, alpha = 1) +
  # geom_smooth(method = "loess", span = .1) +
  # # scale_x_continuous(breaks = seq(0, 3650, by = 365)) +
  # scale_x_date(breaks = as.Date(str_c(1980:1990, "-01-01"))) +
  # scale_color_brewer(palette = "RdBu") +
  # # labs(x = "date", y = "Mortality rate (per million people)",
  # #      title = "Mortality rate in Milan through time") +
  # theme_dark() +
  # theme(legend.position = "none")
```


### Relationships between variables

In the following plot, we can see the relationships between the variables. As we can see, many of them are not linear, therefore a GLM could perform poorly and a GAM could be more suitable. In particular, from the plot that represents the mean temperature and the mortality rate (`mean.temp`, `tot_mort_prob`), we can see that the mortality rate is higher when it is cold and decreases with warmer temperature, but it rapidly increase with really high temperature.

```{r, variables_scatterplot, fig.width=10, fig.height=10, cache = T}
gg <- mort %>% 
  dplyr::select(mean.temp, rel.humid, SO2_log, TSP, tot_mort_prob, resp_mort_prob) %>% 
  ggpairs(lower = list(continuous = wrap("points", alpha = 0.1, size=1)))

gg[6,6] <- ggplot(data = mort,
                  aes(x = resp.mort)) +
  geom_bar(color = "black", fill = "white")

for(i in 2:6){
  for(j in 1:(i-1)){
    gg[i,j] <- gg[i,j] +
      geom_smooth(method = "loess")
  }
}

gg
```


# Statistical Models

## Without considering time

<!-- The best predictive model for mortality rate without taking into account time I found is the following: -->
The best model among those tested for predicting mortality rate without taking time into account is the following:

$$
\begin{align}
\frac{Y_i}{pop_i} &= \beta_0 + s(temp_i, humid_i) + \beta_1 SO2^{log}_i + s(TSP_i) + \epsilon_i \\
\epsilon_i &\sim \mathcal{N}(0,\sigma^2)
\end{align}
$$
where:

* $i$ represent a specific day;
* $Y_i$ is the number of deaths in day $i$;
* $pop_i$ is the population in day $i$;
* $temp_i$ is the average temperature in day $i$;
* $SO2_i$ is the transformed sulphur dioxide level in ambient air in day $i$;
* $TPS_i$ is the total suspended particles in ambient air in day $i$.

The criterion for choosing it has been the AIC. To speed up the process, I decided to use classical GAM models and not Bayesian models.

The estimations are represented in the following chunk:

```{r, fit_noday, echo = T, fig.width=10, fig.height=5, cache = T}
fit_gam_noday <- gam(tot_mort_prob ~ s(mean.temp, rel.humid) + SO2_log + s(TSP),
                     family = gaussian,
                     data = mort)

summary(fit_gam_noday)

par(mar = c(4,4,4,3))
layout(matrix(c(1,1,2,3), ncol = 2))
plot(fit_gam_noday,
     shade = T, scheme = 2,
     residuals = T, all.terms = T)
```

As we previously observed, we can't say if the observed effect of the explanatory variables on the mortality is due to a cause-effect relation or to a spurious correlation.


## Considering time

### Model for mortality rate

The best predictive model for mortality rate I found is the following:

$$
\begin{align}
\frac{Y_i}{pop_i} &= \beta_0 + s(temp_i) + s(doy_i) + \beta_1 year_{1i} + \dots + \beta_J year_{Ji} + \beta_{J+1} weekend_i + \epsilon_i \\
\epsilon_i &\sim \mathcal{N}(0,\sigma^2)
\end{align}
$$
where:

* $i$ represents a specific day;
* $Y_i$ is the number of deaths in day $i$;
* $pop_i$ is the population in day $i$;
* $temp_i$ is the average temperature in day $i$;
* $doy_i$ is the day of the year of the day $i$ (it is a value between 1, that corresponds to 1st January, and 365, that corresponds to 31st December);
* $year_{1i}, \dots, year_{Ji}$ are the dummy variables that represent the year (in our data the years are $1980, 1981, \dots, 1989$);
* $weekend_i$ is a variable that indicate whether $i$ is a weekend day (Saturday or Sunday).

To estimate the parameters I used a Bayesian approach with non-informative prior $\pi(\beta)\propto k\in]0,+\infty[$.

In order to perform the variable selection, I tried several GAM with a classical approach for estimating the parameters, because a Bayesian estimation would have taken too much time and because, using non-informative priors, the estimated values are quite similar.

The estimations are represented in the following chunk:

```{r, weekend_definition, cache = T}
mort <- mort %>% 
  mutate(weekend = day.of.week %in% c(6,7)) %>% 
  select(day.num, day.of.year, year, day.of.week, weekend, holiday, everything())
```

```{r, bayes_best_fit, echo = T, message = F, cache = T}
SEED = 1234
CHAINS = 4
CORES = 4
ITER = 500

fit_gam_bayes_best <- stan_gamm4(tot_mort_prob ~ s(mean.temp) + s(day.of.year) + factor(year) + factor(weekend),
                                 family = gaussian, data = mort,
                                 chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)
```

```{r, bayes_best_coefPlot, fig.width=10, fig.height=6, message = F, cache = T}
p1 <- plot(fit_gam_bayes_best, pars = c(str_c("factor(year)", 1:9), "factor(weekend)TRUE"),
           prob = c(.95,.99)) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(-2.2, 2.2)) #+
  # theme(plot.margin = margin(.5, .5, .5, 2, "cm"))

p2 <- plot(fit_gam_bayes_best, pars = str_c("s(mean.temp).", 1:9),
           prob = c(.95,.99)) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(-100, 100))

p3 <- plot(fit_gam_bayes_best, pars = str_c("s(day.of.year).", 1:9),
           prob = c(.95,.99)) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(-100, 100))

grid.arrange(p1, p2, p3,
             ncol = 1)
```

```{r, bayes_best_coefTab, message = F, cache = T}
# Table with estimated coefficients
coef_est <- as.matrix(fit_gam_bayes_best) %>% 
  apply(FUN = mean, MARGIN = 2)

coeff_int1 <- posterior_interval(fit_gam_bayes_best, prob = .90)
coeff_int2 <- posterior_interval(fit_gam_bayes_best, prob = .95)
coeff_int3 <- posterior_interval(fit_gam_bayes_best, prob = .99)
coeff_int4 <- posterior_interval(fit_gam_bayes_best, prob = .999)


coeff_int_tbb <- as_tibble(coeff_int1) %>%
  cbind(as_tibble(coeff_int2)) %>% 
  cbind(as_tibble(coeff_int3)) %>%
  cbind(as_tibble(coeff_int4)) %>%
  mutate(coeff = rownames(coeff_int1)) %>% 
  select(coeff,
         `0.05%`, `0.5%`, `2.5%`, `5%`,
         everything()) %>% 
  # `95%`, `97.5%`, `99.5%`) %>% 
  # select(coeff, everything()) %>% 
  mutate(signif_90 = as.logical((sign(`5%`)*sign(`95%`) + 1) / 2 ),
         signif_95 = as.logical((sign(`2.5%`)*sign(`97.5%`) + 1) / 2 ),
         signif_99 = as.logical((sign(`0.5%`)*sign(`99.5%`) + 1) / 2 ),
         signif_999 = as.logical((sign(`0.05%`)*sign(`99.95%`) + 1) / 2 )) %>% 
  mutate(mean = coef_est)


coeff_int_tbb %>% 
  mutate(signif = ifelse(signif_999, ".***",
                         ifelse(signif_99, ".**",
                                ifelse(signif_95, ".*",
                                       ifelse(signif_90, ".",
                                              ""))))) %>% 
  # select(-signif_90, -signif_95, -signif_99) %>% 
  select(coeff, `2.5%`, mean, `97.5%`, signif) %>% 
  knitr::kable(digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = F)
```

The following plot shows the marginal effect of `day.of.year` and `mean.temp`.

```{r, bayes_best_nonLinear, fig.width=10, fig.height=3, cache = T}
plot_nonlinear(fit_gam_bayes_best) +
  theme_bw()
```

#### Considering interaction

I also investigated whether it is better to consider or not the interaction between `day.of.year` and `mean.temp` in the spline component of the model (i.e. considering `s(mean.temp) + s(day.of.year)` or `s(mean.temp, day.of.year)`).

```{r, bayes_int_fit, echo = T, message = F, cache = T}
fit_gam_bayes_int <- stan_gamm4(tot_mort_prob ~ s(day.of.year, mean.temp) + factor(year) + factor(weekend),
                                family = gaussian, data = mort,
                                chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)
```

As we can see, plotting the joint effect of `mean.temp` and `day.of.year` for the two models, they are quite similar.

```{r, bayes_comparisonPlot, fig.width=10, fig.height=6, warning = F, cache = T}
grid_doy <- 1:365
grid_temp <- seq(min(mort$mean.temp), max(mort$mean.temp), length.out = 100)

grid <- expand.grid(grid_doy, grid_temp)
names(grid) <- c("day.of.year", "mean.temp")

dat_grid <- as_tibble(grid)

dat_grid <- dat_grid %>%
  mutate(year = 0,
         weekend = FALSE)

dat_grid_fit_best <- predict(fit_gam_bayes_best, newdata = dat_grid)
dat_grid_fit_int <- predict(fit_gam_bayes_int, newdata = dat_grid)

gam_temp_doy <- gam(formula = mean.temp ~ s(day.of.year),
                    data = mort)

alpha = 4

dat_grid <- dat_grid %>%
  mutate(cloud_lw = predict(gam_temp_doy, data.frame(day.of.year)) - alpha*sqrt(gam_temp_doy$sig2),
         cloud_up = predict(gam_temp_doy, data.frame(day.of.year)) + alpha*sqrt(gam_temp_doy$sig2),
         cloud = (cloud_lw < mean.temp & mean.temp < cloud_up))

dat_grid_1 <- dat_grid %>%
  mutate(fit = ifelse(cloud,
                      dat_grid_fit_best,
                      NA))
dat_grid_2 <- dat_grid %>%
  mutate(fit = ifelse(cloud,
                      dat_grid_fit_int,
                      NA))

lim <- c(min(dat_grid_1$fit,
             dat_grid_2$fit,
             na.rm = T),
         max(dat_grid_1$fit,
             dat_grid_2$fit,
             na.rm = T))

p1 <- dat_grid_1 %>%
  ggplot(aes(x = day.of.year, y = mean.temp)) +
  geom_tile(aes(fill = fit)) +
  geom_contour(aes(z = fit),
               color = "white", alpha = 0.5) +
  scale_fill_distiller(palette="Spectral", na.value="white",
                       limits = lim) +
  geom_point(data = mort,
             aes(x = day.of.year, y = mean.temp),
             size = .5) +
  labs(title = "Model without interaction")

p2 <- dat_grid_2 %>%
  ggplot(aes(x = day.of.year, y = mean.temp)) +
  geom_tile(aes(fill = fit)) +
  geom_contour(aes(z = fit),
               color = "white", alpha = 0.5) +
  scale_fill_distiller(palette="Spectral", na.value="white",
                       limits = lim) +
  geom_point(data = mort,
             aes(x = day.of.year, y = mean.temp),
             size = .5) +
  labs(title = "Model with interaction")

grid.arrange(p1,p2,
             ncol = 1)
```

To choose the better among the two, I used the Leave One Out Information Criterion (LOO-IC). According to it, the model without interaction performs better.

```{r, bayes_comparisonLoo, echo = T, fig.width=10, fig.height=5, cache = T}
loo_best <- loo(fit_gam_bayes_best)
loo_int <- loo(fit_gam_bayes_int)

loo_best
loo_int

compare(loo_best,
        loo_int)
```


#### Model checking

The following plots are aimed to check whether the model assumptions are verified in the data.

From the overlay plot and the statistics computed on the whole dataset, the model looks fine.

```{r, bayes_best_overlay, fig.width=10, fig.height=4, cache = T}
y_rep <- posterior_predict(fit_gam_bayes_best, draws = ITER/2)

ppc_dens_overlay(mort$tot_mort_prob, y_rep)
```

```{r, bayes_best_stats, fig.width=10, fig.height=4, message = F, cache = T}
# Mean
p1 <- ppc_stat(
  y = mort$tot_mort_prob,
  yrep = y_rep,
  stat = 'mean'
  )

# Standard deviation
p2 <- ppc_stat(
  y = mort$tot_mort_prob,
  yrep = y_rep,
  stat = 'sd'
  )

grid.arrange(p1, p2,
             nrow = 1)
```

Considering the statistics computed for each year, something strange appears. Looking to the plots for the standard deviation, we can see that in the years 1983 and 1986 the standard deviation in the data is much higher than expected, while in other years, like 1982, 1987 and 1988 it is lower than expected. That means that the model doesn't describe properly the variance within years and the hypothesis of homoscedasticity is not verified.

```{r, bayes_best_stats_grouped_mean, fig.width=10, fig.height=4, message = F, cache = T}
ppc_stat_grouped(
  y = mort$tot_mort_prob[1:(nrow(mort)-3)],
  yrep = y_rep[, 1:(nrow(mort)-3)],
  group = mort$year[1:(nrow(mort)-3)],
  stat = 'mean',
  facet_args = list(nrow = 2, scales = "fixed")
)
```

```{r, bayes_best_stats_grouped_sd, fig.width=10, fig.height=4, message = F, cache = T}
ppc_stat_grouped(
  y = mort$tot_mort_prob[1:(nrow(mort)-3)],
  yrep = y_rep[, 1:(nrow(mort)-3)],
  group = mort$year[1:(nrow(mort)-3)],
  stat = 'sd',
  facet_args = list(nrow = 2, scales = "fixed")
)
```

From the following plot we can see that in the years 1983 and 1986 there are some outliers with a particularly high mortality. In 1983 this phenomenon happened during summer, while in 1986 it happened during the first quarter.

```{r, bayes_best_intervals, fig.width=10, fig.height=5, cache = T}
ppc_intervals_grouped(
  y = mort$tot_mort_prob[1:(nrow(mort)-3)],
  yrep = y_rep[,1:(nrow(mort)-3)],
  x = mort$day.of.year[1:(nrow(mort)-3)],
  group = mort$year[1:(nrow(mort)-3)],
  facet_args = list(nrow = 2, scales = "fixed")
)
```


### Model for respiratory mortality rate

<!-- For modeling the number of deaths for respiratory issues, I used a Poisson distribution. -->
I used a Poisson distribution for modeling the number of deaths for respiratory issues.

The best predictive model for respiratory mortality rate I found is the following:

$$
\begin{align}
Y^{resp}_i &\sim \mathcal{Poisson}(\lambda_i) \\
\frac{\lambda_i}{pop_i} &= e^{\beta_0 + s(temp_i) + s(doy_i) + \beta_1 year_{1i} + \dots + \beta_J year_{Ji} + \beta_{J+1} SO2^{log}_i}
\end{align}
$$
where:

* $i$ represents a specific day;
* $Y^{resp}_i$ is the number of deaths for respiratory issues in day $i$;
* $pop_i$ is the population in day $i$;
* $temp_i$ is the average temperature in day $i$;
* $doy_i$ is the day of the year of the day $i$ (it is a value between 1, that corresponds to 1st January, and 365, that corresponds to 31st December);
* $year_{1i}, \dots, year_{Ji}$ are the dummy variables that represent the year (in our data the years are $1980, 1981, \dots, 1989$);
* $SO2_i$ is the transformed sulphur dioxide level in ambient air in day $i$.

The estimations are represented in the following chunk:

```{r, bayes_resp_best_fit, echo = T, message = F, cache = T}
fit_resp_gam_bayes_best <- stan_gamm4(resp.mort ~ offset(log(pop_m)) + s(mean.temp) + SO2_log + s(day.of.year) + factor(year),
                                      family = poisson, data = mort,
                                      chains = CHAINS, iter = ITER, seed = SEED, cores = CORES)
```

```{r, bayes_resp_best_coefPlot, fig.width=10, fig.height=6, message = F, cache = T}
p1 <- plot(fit_resp_gam_bayes_best, pars = c("SO2_log", str_c("factor(year)", 1:9)),
           prob = c(.95,.99)) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(-.5, .5))

p2 <- plot(fit_resp_gam_bayes_best, pars = str_c("s(mean.temp).", 1:9),
           prob = c(.95,.99)) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(-6, 6))

p3 <- plot(fit_resp_gam_bayes_best, pars = str_c("s(day.of.year).", 1:9),
           prob = c(.95,.99)) +
  geom_vline(xintercept = 0) +
  scale_x_continuous(limits = c(-6, 6))

grid.arrange(p1, p2, p3,
             ncol = 1)
```

```{r, bayes_resp_best_coefTab, message = F, cache = T}
# Table with estimated coefficients
coef_est <- as.matrix(fit_resp_gam_bayes_best) %>% 
  apply(FUN = mean, MARGIN = 2)

coeff_int1 <- posterior_interval(fit_resp_gam_bayes_best, prob = .90)
coeff_int2 <- posterior_interval(fit_resp_gam_bayes_best, prob = .95)
coeff_int3 <- posterior_interval(fit_resp_gam_bayes_best, prob = .99)
coeff_int4 <- posterior_interval(fit_resp_gam_bayes_best, prob = .999)


coeff_int_tbb <- as_tibble(coeff_int1) %>%
  cbind(as_tibble(coeff_int2)) %>% 
  cbind(as_tibble(coeff_int3)) %>%
  cbind(as_tibble(coeff_int4)) %>%
  mutate(coeff = rownames(coeff_int1)) %>% 
  select(coeff,
         `0.05%`, `0.5%`, `2.5%`, `5%`,
         everything()) %>% 
  # `95%`, `97.5%`, `99.5%`) %>% 
  # select(coeff, everything()) %>% 
  mutate(signif_90 = as.logical((sign(`5%`)*sign(`95%`) + 1) / 2 ),
         signif_95 = as.logical((sign(`2.5%`)*sign(`97.5%`) + 1) / 2 ),
         signif_99 = as.logical((sign(`0.5%`)*sign(`99.5%`) + 1) / 2 ),
         signif_999 = as.logical((sign(`0.05%`)*sign(`99.95%`) + 1) / 2 )) %>% 
  mutate(mean = coef_est)


coeff_int_tbb %>% 
  mutate(signif = ifelse(signif_999, ".***",
                         ifelse(signif_99, ".**",
                                ifelse(signif_95, ".*",
                                       ifelse(signif_90, ".",
                                              ""))))) %>% 
  # select(-signif_90, -signif_95, -signif_99) %>% 
  select(coeff, `2.5%`, mean, `97.5%`, signif) %>% 
  knitr::kable(digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = F)
```

The following plot shows the marginal effect of `day.of.year` and `mean.temp`.

```{r, bayes_resp_best_nonLinear, fig.width=10, fig.height=3, cache = T}
plot_nonlinear(fit_resp_gam_bayes_best) +
  theme_bw()
```


#### Model checking

The following plots are aimed to check whether the model assumptions are verified in the data.

As for the previous model, from the overlay plot and the statistics computed on the whole dataset, the model looks fine.


```{r, bayes_resp_best_overlay, fig.width=10, fig.height=4, cache = T}
# Overlay plot
y_rep <- posterior_predict(fit_resp_gam_bayes_best, draws = ITER/2)

ppc_dens_overlay(mort$resp.mort, y_rep)
```

```{r, bayes_resp_best_stats, fig.width=10, fig.height=4, message = F, cache = T}
# Mean
p1 <- ppc_stat(
  y = mort$resp.mort,
  yrep = y_rep,
  # group = pest_data$building_id,
  stat = 'mean'
)

# Standard deviation
p2 <- ppc_stat(
  y = mort$resp.mort,
  yrep = y_rep,
  # group = pest_data$building_id,
  stat = 'sd'
)

grid.arrange(p1, p2,
             nrow = 1)
```

As in the previous model, the standard deviation of the response variable computed in the dataset for the year 1983 and 1986 is higher than expected.

```{r, bayes_resp_best_stats_grouped_mean, fig.width=10, fig.height=4, message = F, cache = T}
ppc_stat_grouped(
  y = mort$resp.mort[1:(nrow(mort)-3)],
  yrep = y_rep[, 1:(nrow(mort)-3)],
  group = mort$year[1:(nrow(mort)-3)],
  stat = 'mean',
  facet_args = list(nrow = 2, scales = "fixed")
)
```

```{r, bayes_resp_best_stats_grouped_sd, fig.width=10, fig.height=4, message = F, cache = T}
ppc_stat_grouped(
  y = mort$resp.mort[1:(nrow(mort)-3)],
  yrep = y_rep[, 1:(nrow(mort)-3)],
  group = mort$year[1:(nrow(mort)-3)],
  stat = 'sd',
  facet_args = list(nrow = 2, scales = "fixed")
)
```


As we saw for mortality in general, we can see that in the years 1983 and 1986 there are some outliers with a particularly high respiratory mortality. In 1983 this phenomenon happened during summer, while in 1986 it happened during the first quarter.

```{r, bayes_resp_best_intervals, fig.width=10, fig.height=5, cache = T}
ppc_intervals_grouped(
  y = mort$resp.mort[1:(nrow(mort)-3)],
  yrep = y_rep[, 1:(nrow(mort)-3)],
  x = mort$day.of.year[1:(nrow(mort)-3)],
  group = mort$year[1:(nrow(mort)-3)],
  facet_args = list(nrow = 2, scales = "fixed")
)
```


## Model interpretations

The variables that have a significative effect on the response are:

* for mortality in general:
  - `mean.temp`
  - `day.of.year`
  - `year`
  - `weekend`
* for respiratory mortality:
  - `mean.temp`
  - `day.of.year`
  - `year`
  - `SO2`

That means that, even considering the time as an explanatory variable, `mean.temp` has a significative effect. Looking to the marginal effect plot for both the model (in the previous chapter) we can see that the mortality is higher when it is really cold and when it is really hot. On the contrary, `rel.humid` doesn't have a significative effect on mortality.

For what concerns the pollution variables (`SO2` and `TSP`), we can see that `TSP` doesn't have a significative effect on mortality, while `SO2` has a significative effect if we consider only the death for respiratory issues. Furthermore, the `SO2` coefficient is positive, so the hypothesis that the concentration of sulphur dioxide has an impact on respiratory death is supported by the data. However, since the respiratory issues are just a small part of the whole cause of death, it doesn't result significative on the mortality in general.

The `weekend` has a significative effect on the mortality in general, while it doesn't affect respiratory mortality. The coefficient of `weekend` is negative, so on weekend the mortality is lower. This is probably due to the fact that on weekend people work less and there is less traffic, so there are less accidents at work and by car and people are more relaxed and less stressed.


# Possible improvements

A great improvement for the model would be obtained by **using new variables**, such as the cause of death and the age at death. With the cause of death we could build a model for each cause and then ensemble them to built a more accurate model. With the age at death we could build models for specific mortality rate and take into account the age distribution.

To improve the estimations without getting more data, we could consider the effect of `day.of.year` not just as a spline on the values between 1 and 365, but as a **periodic function**, with period 365. That would bring to a function with $s(0) = s(365)$ that would be more reasonable for our problem.

To deal with the outliers we could also consider other kind of models, like the **Quasi-Poisson** and the **Negative Binomial**. I tried with the Negative Binomial, but I incurred in troubles with the non-convergence of the Markov Chains.

Another idea would be to insert the effect of the year not just as an additive factor, but as another **spline function on the day number** (that goes from $1$ to $10\times 365$). Thus the effect would be: `s(day.of.year) + s(day.num)`. This approach would better fit the data, but it would reduce the interpretability of the model.

Otherwise, keeping the year additive effect, we can consider to build a **hierarchical model** in which `year` constitute the deeper level and `day.of.year` the higher level. This model would be useful to make predictions on the future, because the `year` wouldn't be an explanatory variable with realizations only in $1, \dots, J$.

Finally, a different approach that could work fine with this data is an **Auto Regressive model**, that could deal with the seasonality, adding for each day the correlation between mortality in that day and mortality 365 days before.
 <!-- adding the correlation between the mortality in each day and the mortality in 365 days before. -->