11.Rmd


```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```

# God Spiked the Integers

> The most common and useful generalized linear models are models for counts. Counts
are non-negative integers--0, 1, 2, and so on. They are the basis of all mathematics, the first bits that children learn. But they are also intoxicatingly complicated to model--hence the apocryphal slogan that titles this chapter. The essential problem is this: When what we wish to predict is a count, the scale of the parameters is never the same as the scale of the outcome. A count golem, like a tide prediction engine, has a whirring machinery underneath that doesn't resemble the output. Keeping the tide engine in mind, you can master these models and use them responsibly.
>
> We will engineer complete examples of the two most common types of count model. **Binomial regression** is the name we'll use for a family of related procedures that all model a binary classification--alive/dead, accept/reject, left/right--for which the total of both categories is known. This is like the marble and globe tossing examples from [Chapter 2][Small Worlds and Large Worlds]. But now you get to incorporate predictor variables. **Poisson regression** is a GLM that models a count with an unknown maximum—number of elephants in Kenya, number of applications to a PhD program, number of significance tests in an issue of [*Psychological Science*](https://www.psychologicalscience.org/publications/psychological_science). As described in [Chapter 10][Big Entropy and the Generalized Linear Model], the Poisson model is a special case of binomial. At the end, the chapter describes some other count regressions. [@mcelreathStatisticalRethinkingBayesian2020, p. 323, **emphasis** in the original]

In this chapter, we focus on the two most common types of count models: the binomial and the Poisson.

## Binomial regression

The basic binomial model follows the form

$$y \sim \operatorname{Binomial}(n, p),$$

where $y$ is some count variable, $n$ is the number of trials, and $p$ it the probability a given trial was a 1, which is sometimes termed a *success*. When $n = 1$, then $y$ is a vector of 0's and 1's. Presuming the logit link[^4], which we just covered in [Chapter 10][Linking linear models to distributions.], models of this type are commonly termed logistic regression. When $n > 1$, and still presuming the logit link, we might call our model an aggregated logistic regression model, or more generally an aggregated binomial regression model.

### Logistic regression: Prosocial chimpanzees.

Load the @silkChimpanzeesAreIndifferent2005 `chimpanzees` data.

```{r, message = F, warning = F}
data(chimpanzees, package = "rethinking")
d <- chimpanzees
rm(chimpanzees)
```

The data include two experimental conditions, `prosoc_left` and `condition`, each of which has two levels. This results in four combinations.

```{r, message = F, warning = F}
library(tidyverse)
library(flextable)

d %>% 
  distinct(prosoc_left, condition) %>% 
  mutate(description = str_c("Two food items on ", c("right and no partner",
                                                     "left and no partner",
                                                     "right and partner present",
                                                     "left and partner present"))) %>% 
  flextable() %>% 
  width(width = c(1, 1, 4))
```

It would be conventional to include these two variables and their interaction using dummy variables. We're going to follow McElreath and use an index variable approach, instead. If you'd like to see what this would look like using the dummy variable approach, check out my [-@kurzStatisticalRethinkingBrms2020] [translation of the corresponding section](https://bookdown.org/content/3890/counting-and-classification.html#logistic-regression-prosocial-chimpanzees.) from McElreath's first [-@mcelreathStatisticalRethinkingBayesian2015] edition. For now, make the index, which we'll be saving as a factor.

```{r}
d <-
  d %>% 
  mutate(treatment = factor(1 + prosoc_left + 2 * condition)) %>% 
  # this will come in handy, later
  mutate(labels = factor(treatment,
                         levels = 1:4,
                         labels = c("r/n", "l/n", "r/p", "l/p")))
```

We can use the `dplyr::count()` function to get a sense of the distribution of the conditions in the data.

```{r}
d %>% 
  count(condition, treatment, prosoc_left)
```
 
Fire up **brms**.

```{r, message = F, warning = F}
library(brms)
```
 
We start with the simple intercept-only logistic regression model, which follows the statistical formula

\begin{align*}
\text{pulled_left}_i & \sim \operatorname{Binomial}(1, p_i) \\
\operatorname{logit}(p_i) & = \alpha \\
\alpha & \sim \operatorname{Normal}(0, w),
\end{align*}

where $w$ is the hyperparameter for $\sigma$ the value for which we have yet to choose. To start things off, we'll set $w = 10$, fit a model with where we set `sample_prior = TRUE`, and get a sense of the prior on a plot.

In the `brm()` `formula` syntax, including a `|` bar on the left side of a formula indicates we have extra supplementary information about our criterion. In this case, that information is that each `pulled_left` value corresponds to a single trial (i.e., `trials(1)`), which itself corresponds to the $n = 1$ portion of the statistical formula, above.

```{r b11.1}
b11.1 <-
  brm(data = d, 
      family = binomial,
      pulled_left | trials(1) ~ 1,
      prior(normal(0, 10), class = Intercept),
      seed = 11,
      sample_prior = T,
      file = "fits/b11.01")
```

Before we go any further, let's discuss the plot theme. For this chapter, we'll take our color scheme from the `"Moonrise2"` palette from the [**wesanderson** package](https://cran.r-project.org/package=wesanderson) [@R-wesanderson].

```{r, message = F, fig.width = 3, fig.height = 1}
library(wesanderson)
wes_palette("Moonrise2")
wes_palette("Moonrise2")[1:4]
```

We'll also take a few formatting cues from Edward Tufte [-@tufteVisualDisplayQuantitative2001], courtesy of the [**ggthemes package**](https://cran.r-project.org/package=ggthemes). The `theme_tufte()` function will change the default font and remove some chart junk. The `theme_set()` function, below, will make these adjustments the default for all subsequent **ggplot2** plots. To undo this, just execute `theme_set(theme_default())`.

```{r, message = F, warning = F}
library(ggthemes)

theme_set(
  theme_default() + 
    theme_tufte() +
    theme(plot.background = element_rect(fill = wes_palette("Moonrise2")[3],
                                         color = wes_palette("Moonrise2")[3]))
)
```

Now we're ready to plot. We'll extract the prior draws with `prior_draws()`, convert them from the log-odds metric to the probability metric with the `brms::inv_logit_scaled()` function, and adjust the bandwidth of the density plot with the `adjust` argument within `geom_density()`.

```{r, fig.height = 2.75, fig.width = 3, warning = F, message = F}
prior_draws(b11.1) %>% 
  mutate(p = inv_logit_scaled(Intercept)) %>% 
  
  ggplot(aes(x = p)) +
  geom_density(fill = wes_palette("Moonrise2")[4], 
               size = 0, adjust = 0.1) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("prior prob pull left")
```

At this point in the analysis, we were only able to make part of the left panel of McElreath's Figure 11.3. We'll add to it in a bit. Now update the model so that $w = 1.5$.

```{r b11.1b}
b11.1b <-
  brm(data = d, 
      family = binomial,
      pulled_left | trials(1) ~ 1,
      prior(normal(0, 1.5), class = Intercept),
      seed = 11,
      sample_prior = T,
      file = "fits/b11.01b")
```

Now we can make the full version of the left panel of Figure 11.3.

```{r, fig.height = 2.75, fig.width = 3, warning = F, message = F}
# wrangle
bind_rows(prior_draws(b11.1),
          prior_draws(b11.1b)) %>% 
  mutate(p = inv_logit_scaled(Intercept),
         w = factor(rep(c(10, 1.5), each = n() / 2),
                    levels = c(10, 1.5))) %>% 
  
  # plot
  ggplot(aes(x = p, fill = w)) +
  geom_density(size = 0, alpha = 3/4, adjust = 0.1) +
  scale_fill_manual(expression(italic(w)), values = wes_palette("Moonrise2")[c(4, 1)]) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = expression(alpha%~%Normal(0*", "*italic(w))),
       x = "prior prob pull left")
```

If we'd like to fit a model that includes an overall intercept and uses McElreath'd index variable approach for the predictor variable `treatment`, we'll have to switch to the **brms** non-linear syntax. Here it is for the models using $w = 10$ and then $w = 0.5$.

```{r b11.2}
# w = 10
b11.2 <- 
  brm(data = d, 
      family = binomial,
      bf(pulled_left | trials(1) ~ a + b,
         a ~ 1, 
         b ~ 0 + treatment,
         nl = TRUE),
      prior = c(prior(normal(0, 1.5), nlpar = a),
                prior(normal(0, 10), nlpar = b, coef = treatment1),
                prior(normal(0, 10), nlpar = b, coef = treatment2),
                prior(normal(0, 10), nlpar = b, coef = treatment3),
                prior(normal(0, 10), nlpar = b, coef = treatment4)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      sample_prior = T,
      file = "fits/b11.02")

# w = 0.5
b11.3 <- 
  brm(data = d, 
      family = binomial,
      bf(pulled_left | trials(1) ~ a + b,
         a ~ 1, 
         b ~ 0 + treatment,
         nl = TRUE),
      prior = c(prior(normal(0, 1.5), nlpar = a),
                prior(normal(0, 0.5), nlpar = b, coef = treatment1),
                prior(normal(0, 0.5), nlpar = b, coef = treatment2),
                prior(normal(0, 0.5), nlpar = b, coef = treatment3),
                prior(normal(0, 0.5), nlpar = b, coef = treatment4)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      sample_prior = T,
      file = "fits/b11.03")
```

If all you want to do is fit the models, you wouldn't have to add a separate `prior()` statement for each level of `treatment`. You could have just included a single line, `prior(normal(0, 0.5), nlpar = b)`, that did not include a `coef` argument. The problem with this approach is we'd only get one column for `treatment` when using the `prior_draws()` function to retrieve the prior samples. To get separate columns for the prior samples of each of the levels of `treatment`, you need to take the verbose approach, above.

Anyway, here's how to make a version of the right panel of Figure 11.3.

```{r, fig.height = 2.75, fig.width = 3, warning = F, message = F}
# wrangle
prior <-
  bind_rows(prior_draws(b11.2),
            prior_draws(b11.3)) %>% 
  mutate(w  = factor(rep(c(10, 0.5), each = n() / 2),
                     levels = c(10, 0.5)),
         p1 = inv_logit_scaled(b_a + b_b_treatment1),
         p2 = inv_logit_scaled(b_a + b_b_treatment2)) %>% 
  mutate(diff = abs(p1 - p2)) 

# plot
prior %>% 
  ggplot(aes(x = diff, fill = w)) +
  geom_density(size = 0, alpha = 3/4, adjust = 0.1) +
  scale_fill_manual(expression(italic(w)), values = wes_palette("Moonrise2")[c(4, 2)]) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(title = expression(alpha%~%Normal(0*", "*italic(w))),
       x = "prior diff between treatments")
```

Here are the averages of the two prior-predictive difference distributions.

```{r, message = F}
prior %>% 
  group_by(w) %>% 
  summarise(mean = mean(diff))
```

Before we move on to fit the full model, it might be useful to linger here and examine the nature of the model we just fit. Here's the parameter summary for `b11.3`.

```{r}
print(b11.3)
```

Now focus on the likelihood portion of the model formula,

\begin{align*}
\text{pulled_left}_i & \sim \operatorname{Binomial}(1, p_i) \\
\operatorname{logit}(p_i) & = \alpha + \beta_\text{treatment} .
\end{align*}

When you have one overall intercept $\alpha$ and then use the non-linear approach for the `treatment` index, you end up with as many $\beta$ parameters as there levels for `treatment`. This means the formula for `treatment == 1` is $\alpha + \beta_{\text{treatment}[1]}$, the formula for `treatment == 2` is $\alpha + \beta_{\text{treatment}[2]}$, and so on. This also effectively makes $\alpha$ the grand mean. Here's the empirical grand mean.

```{r}
d %>% 
  summarise(grand_mean = mean(pulled_left))
```

Now here’s the summary of $\alpha$ after transforming it back into the probability metric with the `inv_logit_scaled()` function.

```{r, warning = F, message = F}
library(tidybayes)

as_draws_df(b11.3) %>% 
  transmute(alpha = inv_logit_scaled(b_a_Intercept)) %>% 
  mean_qi()
```

Here are the empirical probabilities for each of the four levels of `treatment`.

```{r, message = F}
d %>% 
  group_by(treatment) %>% 
  summarise(mean = mean(pulled_left))
```

Here are the corresponding posteriors.

```{r, warning = F}
as_draws_df(b11.3) %>% 
  pivot_longer(b_b_treatment1:b_b_treatment4) %>% 
  mutate(treatment = str_remove(name, "b_b_treatment"),
         mean      = inv_logit_scaled(b_a_Intercept + value)) %>%
  group_by(treatment) %>% 
  mean_qi(mean)
```

Okay, let's get back on track with the text. Now we're ready to fit the full model, which follows the form

\begin{align*}
\text{pulled_left}_i      & \sim \operatorname{Binomial}(1, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\color{#54635e}{\text{actor}}[i]} + \beta_{\color{#a4692f}{\text{treatment}}[i]} \\
\alpha_{\color{#54635e}j} & \sim \operatorname{Normal}(0, 1.5) \\
\beta_{\color{#a4692f}k}  & \sim \operatorname{Normal}(0, 0.5).
\end{align*}

Before fitting the model, we should save `actor` as a factor.

```{r}
d <-
  d %>% 
  mutate(actor = factor(actor))
```

Now fit the model.

```{r b11.4}
b11.4 <- 
  brm(data = d, 
      family = binomial,
      bf(pulled_left | trials(1) ~ a + b,
         a ~ 0 + actor, 
         b ~ 0 + treatment,
         nl = TRUE),
      prior = c(prior(normal(0, 1.5), nlpar = a),
                prior(normal(0, 0.5), nlpar = b)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.04")
```

Inspect the parameter summary.

```{r}
print(b11.4)
```

Here's how we might make our version of McElreath's coefficient plot of the $\alpha$ parameters.

```{r, fig.width = 4.5, fig.height = 1.33, warning = F}
library(tidybayes)

post <- as_draws_df(b11.4)

post %>% 
  pivot_longer(contains("actor")) %>%
  mutate(probability = inv_logit_scaled(value),
         actor       = factor(str_remove(name, "b_a_actor"),
                              levels = 7:1)) %>% 
  
  ggplot(aes(x = probability, y = actor)) +
  geom_vline(xintercept = .5, color = wes_palette("Moonrise2")[1], linetype = 3) +
  stat_pointinterval(.width = .95, size = 1/2,
                     color = wes_palette("Moonrise2")[4]) +
  scale_x_continuous(expression(alpha[actor]), limits = 0:1) +
  ylab(NULL) +
  theme(axis.ticks.y = element_blank())
```

Here's the corresponding coefficient plot of the $\beta$ parameters.

```{r, fig.width = 3.5, fig.height = 1, warning = F}
tx <- c("R/N", "L/N", "R/P", "L/P")

post %>% 
  select(contains("treatment")) %>% 
  set_names("R/N","L/N","R/P","L/P") %>% 
  pivot_longer(everything()) %>%
  mutate(probability = inv_logit_scaled(value),
         treatment   = factor(name, levels = tx)) %>% 
  mutate(treatment = fct_rev(treatment)) %>% 
  
  ggplot(aes(x = value, y = treatment)) +
  geom_vline(xintercept = 0, color = wes_palette("Moonrise2")[2], linetype = 3) +
  stat_pointinterval(.width = .95, size = 1/2,
                     color = wes_palette("Moonrise2")[4]) +
  labs(x = expression(beta[treatment]),
       y = NULL) +
  theme(axis.ticks.y = element_blank())
```

Now make the coefficient plot for the primary contrasts of interest.

```{r, fig.width = 3, fig.height = .8, warning = F}
post %>% 
  mutate(db13 = b_b_treatment1 - b_b_treatment3,
         db24 = b_b_treatment2 - b_b_treatment4) %>% 
  pivot_longer(db13:db24) %>%
  mutate(diffs = factor(name, levels = c("db24", "db13"))) %>% 
  
  ggplot(aes(x = value, y = diffs)) +
  geom_vline(xintercept = 0, color = wes_palette("Moonrise2")[2], linetype = 3) +
  stat_pointinterval(.width = .95, size = 1/2,
                     color = wes_palette("Moonrise2")[4]) +
  labs(x = "difference",
       y = NULL) +
  theme(axis.ticks.y = element_blank())
```

"These are the contrasts between the no-partner/partner treatments" (p. 331). Next, we prepare for the posterior predictive check. McElreath showed how to compute empirical proportions by the levels of `actor` and `treatment` with the `by()` function. Our approach will be with a combination of `group_by()` and `summarise()`. Here's what that looks like for `actor == 1`.

```{r, message = F}
d %>%
  group_by(actor, treatment) %>%
  summarise(proportion = mean(pulled_left)) %>% 
  filter(actor == 1)
```

Now we'll follow that through to make the top panel of Figure 11.4. Instead of showing the plot, we'll save it for the next code block.

```{r, message = F}
p1 <-
  d %>%
  group_by(actor, treatment) %>%
  summarise(proportion = mean(pulled_left)) %>% 
  left_join(d %>% distinct(actor, treatment, labels, condition, prosoc_left),
            by = c("actor", "treatment")) %>% 
  mutate(condition = factor(condition)) %>% 
  
  ggplot(aes(x = labels, y = proportion)) +
  geom_hline(yintercept = .5, color = wes_palette("Moonrise2")[3]) +
  geom_line(aes(group = prosoc_left),
            size = 1/4, color = wes_palette("Moonrise2")[4]) +
  geom_point(aes(color = condition),
             size = 2.5, show.legend = F) + 
  labs(subtitle = "observed proportions")
```

Next we use `brms()` fitted to get the posterior predictive distributions for each unique combination of `actor` and `treatment`, wrangle, and plot. First, we save the plot as `p2` and then we use **patchwork** syntax to combine the two subplots.

```{r, fig.width = 7, fig.height = 4.5}
nd <- 
  d %>% 
  distinct(actor, treatment, labels, condition, prosoc_left)

p2 <-
  fitted(b11.4,
         newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  mutate(condition = factor(condition)) %>% 
  
  ggplot(aes(x = labels, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_hline(yintercept = .5, color = wes_palette("Moonrise2")[3]) +
  geom_line(aes(group = prosoc_left),
            size = 1/4, color = wes_palette("Moonrise2")[4]) +
  geom_pointrange(aes(color = condition),
                  fatten = 2.5, show.legend = F) + 
  labs(subtitle = "posterior predictions")

# combine the two ggplots
library(patchwork)

(p1 / p2) &
  scale_color_manual(values = wes_palette("Moonrise2")[c(2:1)]) &
  scale_y_continuous("proportion left lever", 
                     breaks = c(0, .5, 1), limits = c(0, 1)) &
  xlab(NULL) &
  theme(axis.ticks.x = element_blank(),
        panel.background = element_rect(fill = alpha("white", 1/10), size = 0)) &
  facet_wrap(~ actor, nrow = 1, labeller = label_both)
```

Let's make two more index variables.

```{r}
d <-
  d %>% 
  mutate(side = factor(prosoc_left + 1),  # right 1, left 2
         cond = factor(condition + 1))    # no partner 1, partner 2
```

Now fit the model without the interaction between `prosoc_left` and `condition`.

```{r b11.5}
b11.5 <- 
  brm(data = d, 
      family = binomial,
      bf(pulled_left | trials(1) ~ a + bs + bc,
         a ~ 0 + actor, 
         bs ~ 0 + side, 
         bc ~ 0 + cond,
         nl = TRUE),
      prior = c(prior(normal(0, 1.5), nlpar = a),
                prior(normal(0, 0.5), nlpar = bs),
                prior(normal(0, 0.5), nlpar = bc)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.05")
```

Compare `b11.4` and `b11.5` by the PSIS-LOO and the WAIC.

```{r, warning = F, message = F}
b11.4 <- add_criterion(b11.4, c("loo", "waic"))
b11.5 <- add_criterion(b11.5, c("loo", "waic"))

loo_compare(b11.4, b11.5, criterion = "loo") %>% print(simplify = F)
loo_compare(b11.4, b11.5, criterion = "waic") %>% print(simplify = F)
```

Here are the weights.

```{r}
model_weights(b11.4, b11.5, weights = "loo") %>% round(digits = 2)
model_weights(b11.4, b11.5, weights = "waic") %>% round(digits = 2)
```

Here's a quick check of the parameter summary for the non-interaction model, `b11.5`.

```{r}
print(b11.5)
```

Because it's good practice, here's the `b11.5` version of the bottom panel of Figure 11.4.

```{r, fig.width = 7, fig.height = 2.2}
nd <- 
  d %>% 
  distinct(actor, treatment, labels, cond, side)

fitted(b11.5,
       newdata = nd) %>% 
  data.frame() %>% 
  bind_cols(nd) %>% 
  
  ggplot(aes(x = labels, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
  geom_hline(yintercept = .5, color = wes_palette("Moonrise2")[3]) +
  geom_line(aes(group = side),
            size = 1/4, color = wes_palette("Moonrise2")[4]) +
  geom_pointrange(aes(color = cond),
                  fatten = 2.5, show.legend = F) + 
  scale_color_manual(values = wes_palette("Moonrise2")[c(2:1)]) +
  scale_y_continuous("proportion left lever", 
                     breaks = c(0, .5, 1), limits = c(0, 1)) +
  labs(subtitle = "posterior predictions for b11.5",
       x = NULL) +
  theme(axis.ticks.x = element_blank(),
        panel.background = element_rect(fill = alpha("white", 1/10), size = 0)) +
  facet_wrap(~ actor, nrow = 1, labeller = label_both)
```

#### Overthinking: Adding log-probability calculations to a Stan model.

For retrieving log-probability summaries, our approach with **brms** is a little different than the one you might take with McElreath's **rethinking**. Rather than adding a `log_lik=TRUE` argument within `rethinking::ulam()`, we just use the `log_lik()` function after fitting a **brms** model. You may recall we already practiced this way back in [Section 7.2.4.1][Overthinking: Computing the lppd.]. Here's a quick example of what that looks like for `b11.5`.

```{r}
log_lik(b11.5) %>% str()
```

### Relative shark and absolute deer.

Based on the full model, `b11.4`, here's how you might compute the posterior mean and 95% intervals for the proportional odds of switching from `treatment == 2` to `treatment == 4`.

```{r}
as_draws_df(b11.4) %>% 
  mutate(proportional_odds = exp(b_b_treatment4 - b_b_treatment2)) %>% 
  mean_qi(proportional_odds)
```

> On average, the switch multiplies the odds of pulling the left lever by 0.92, an 8% reduction in odds. This is what is meant by proportional odds. The new odds are calculated by taking the old odds and multiplying them by the proportional odds, which is 0.92 in this example. (p. 336)

A limitation of relative measures measures like proportional odds is they ignore what you might think of as the reference or the baseline.

> Consider for example a rare disease which occurs in 1 per 10-million people. Suppose also that reading this textbook increased the odds of the disease 5-fold. That would mean approximately 4 more cases of the disease per 10-million people. So only 5-in-10-million chance now. The book is safe. (p. 336)

Here that is in code.

```{r}
tibble(disease_rate  = 1/1e7,
       fold_increase = 5) %>% 
  mutate(new_disease_rate = disease_rate * fold_increase)
```

The hard part, though, is that "neither absolute nor relative risk is sufficient for all purposes" (p. 337). Each provides its own unique perspective on the data. Again, welcome to applied statistics. `r emo::ji("man_shrugging")`

### Aggregated binomial: Chimpanzees again, condensed.

With the **tidyverse**, we can use `group_by()` and `summarise()` to achieve what McElreath did with `aggregate()`.

```{r, message = F}
d_aggregated <-
  d %>%
  group_by(treatment, actor, side, cond) %>%
  summarise(left_pulls = sum(pulled_left)) %>% 
  ungroup()

d_aggregated %>%
  head(n = 8)
```

To fit an aggregated binomial model with **brms**, we augment the `<criterion> | trials()` syntax where the value that goes in `trials()` is either a fixed number, as in this case, or variable in the data indexing $n$. Either way, at least some of those trials will have an $n > 1$. Here we'll use the hard-code method, just like McElreath did in the text.

```{r b11.6}
b11.6 <- 
  brm(data = d_aggregated, 
      family = binomial,
      bf(left_pulls | trials(18) ~ a + b,
         a ~ 0 + actor, 
         b ~ 0 + treatment,
         nl = TRUE),
      prior = c(prior(normal(0, 1.5), nlpar = a),
                prior(normal(0, 0.5), nlpar = b)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.06")
```

Check the posterior summary.

```{r}
print(b11.6)
```

It might be easiest to compare `b11.4` and `b11.6` with a coefficient plot.

```{r, fig.width = 7, fig.height = 2.5, warning = F}
# this is just for fancy annotation
text <-
  tibble(value = c(1.4, 2.6),
         name  = "b_a_actor7",
         fit   = c("b11.6", "b11.4"))

# rope in the posterior draws and wrangle
bind_rows(as_draws_df(b11.4),
          as_draws_df(b11.6)) %>% 
  mutate(fit = rep(c("b11.4", "b11.6"), each = n() / 2)) %>% 
  pivot_longer(b_a_actor1:b_b_treatment4) %>% 
  
  # plot
  ggplot(aes(x = value, y = name, color = fit)) +
  stat_pointinterval(.width = .95, size = 2/3,
                     position = position_dodge(width = 0.5)) +
  scale_color_manual(values = wes_palette("Moonrise2")[2:1]) +
  geom_text(data = text,
            aes(label = fit),
            family = "Times", position = position_dodge(width = 2.25)) +
  labs(x = "posterior (log-odds scale)",
       y = NULL) +
  theme(axis.ticks.y = element_blank(),
        legend.position = "none")
```

Did you catch our `position = position_dodge()` tricks? Try executing the plot without those parts of the code to get a sense of what they did. Now compute and save the PSIS-LOO estimates for the two models so we might compare them.

```{r, warning = F, message = F}
b11.4 <- add_criterion(b11.4, "loo")
b11.6 <- add_criterion(b11.6, "loo")
```

Here's how we might attempt the comparison.

```{r, eval = F}
loo_compare(b11.4, b11.6, criterion = "loo") %>% print(simplify = F)
```

Unlike with McElreath's `compare()` code in the text, `loo_compare()` wouldn't even give us the results. All we get is the warning message informing us that because these two models are not based on the same data, comparing them with the LOO is invalid and **brms** refuses to let us do it. We can, however, look at their LOO summaries separately. 

```{r}
loo(b11.4)
loo(b11.6)
```

To understand what's going on, consider how you might describe six 1's out of nine trials in the aggregated form,

$$\text{Pr}(6|9, p) = \frac{6!}{6!(9 - 6)!} p^6 (1 - p)^{9 - 6}.$$

If we still stick with the same data, but this time re-express those as nine dichotomous data points, we now describe their joint probability as

$$\text{Pr}(1, 1, 1, 1, 1, 1, 0, 0, 0 | p) = p^6 (1 - p)^{9 - 6}.$$

Let's work this out in code.

```{r}
# deviance of aggregated 6-in-9 
-2 * dbinom(6, size = 9, prob = 0.2, log = TRUE)

# deviance of dis-aggregated 
-2 * sum(dbinom(c(1, 1, 1, 1, 1, 1, 0, 0, 0), size = 1, prob = 0.2, log = TRUE))
```

> But this difference is entirely meaningless. It is just a side effect of how we organized the data. The posterior distribution for the probability of success on each trial will end up the same, either way. (p. 339)

This is what our coefficient plot showed us, above. The posterior distribution was the same within simulation variance for `b11.4` and `b11.6`. Just like McElreath reported in the text, we also got a warning about high Pareto $k$ values from the aggregated binomial model, `b11.6`. To access the message and its associated table directly, we can feed the results of `loo()` into the `loo::pareto_k_table` function.

```{r}
loo(b11.6) %>% 
  loo::pareto_k_table()
```

> Before looking at the Pareto $k$ values, you might have noticed already that we didn't get a similar warning before in the disaggregated logistic models of the same data. Why not? Because when we aggregated the data by actor-treatment, we forced PSIS (and WAIC) to imagine cross-validation that leaves out all 18 observations in each actor-treatment combination. So instead of leave-one-out cross-validation, it is more like leave-eighteen-out. This makes some observations more influential, because they are really now 18 observations.
>
> What's the bottom line? If you want to calculate WAIC or PSIS, you should use a logistic regression data format, not an aggregated format. Otherwise you are implicitly assuming that only large chunks of the data are separable. (p. 340)

### Aggregated binomial: Graduate school admissions.

Load the infamous `UCBadmit` data [see @bickelSexBiasGraduate1975].

```{r, message = F, warning = F}
data(UCBadmit, package = "rethinking")
d <- UCBadmit
rm(UCBadmit)

d
```

Now compute our new index variable, `gid`. We'll also slip in a `case` variable that saves the row numbers as a factor. That'll come in handy later when we plot.

```{r}
d <- 
  d %>%  
  mutate(gid  = factor(applicant.gender, levels = c("male", "female")),
         case = factor(1:n()))
```

Note the difference in how we defined out `gid`. Whereas McElreath used numeral indices, we retained the text within an ordered factor. **brms** can handle either approach just fine. The advantage of the factor approach is it will be easier to understand the output. You'll see in just a bit.

The univariable logistic model with `male` as the sole predictor of `admit` follows the form

\begin{align*}
\text{admit}_i    & \sim \operatorname{Binomial}(n_i, p_i) \\
\text{logit}(p_i) & = \alpha_{\text{gid}[i]} \\
\alpha_j          & \sim \operatorname{Normal}(0, 1.5),
\end{align*}

where $n_i = \text{applications}_i$, the rows are indexed by $i$, and the two levels of $\text{gid}$ are indexed by $j$. Since we're only using our index variable `gid` to model two intercepts with no further complications, we don't need to use the verbose non-linear syntax to fit this model with **brms**.

```{r b11.7}
b11.7 <-
  brm(data = d, 
      family = binomial,
      admit | trials(applications) ~ 0 + gid,
      prior(normal(0, 1.5), class = b),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.07") 
```

```{r}
print(b11.7)
```

Our results are very similar to those in the text. But notice how our two rows have more informative row names than `a[1]` and `a[2]`. This is why you might consider using the ordered factor approach rather than using numeral indices.

Anyway, here we'll compute the difference score in two metrics and summarize them with a little help from `mean_qi()`.

```{r, warning = F, message = F}
as_draws_df(b11.7) %>% 
  mutate(diff_a = b_gidmale - b_gidfemale,
         diff_p = inv_logit_scaled(b_gidmale) - inv_logit_scaled(b_gidfemale)) %>% 
  pivot_longer(contains("diff")) %>% 
  group_by(name) %>% 
  mean_qi(value, .width = .89)
```

**brms** doesn't have a convenience function that works quite like `rethinking::postcheck()`. But we have options, the most handy of which in this case is probably `predict()`.

```{r, fig.height = 2.75, fig.width = 4.5, message = F}
p <- 
  predict(b11.7) %>% 
  data.frame() %>% 
  bind_cols(d)

text <-
  d %>%
  group_by(dept) %>%
  summarise(case  = mean(as.numeric(case)),
            admit = mean(admit / applications) + .05)

p %>% 
  ggplot(aes(x = case, y = admit / applications)) +
  geom_pointrange(aes(y    = Estimate / applications,
                      ymin = Q2.5     / applications ,
                      ymax = Q97.5    / applications),
                  color = wes_palette("Moonrise2")[1],
                  shape = 1, alpha = 1/3) +
  geom_point(color = wes_palette("Moonrise2")[2]) +
  geom_line(aes(group = dept),
            color = wes_palette("Moonrise2")[2]) +
  geom_text(data = text,
            aes(y = admit, label = dept),
            color = wes_palette("Moonrise2")[2],
            family = "serif") +
  scale_y_continuous("Proportion admitted", limits = 0:1) +
  ggtitle("Posterior validation check") +
  theme(axis.ticks.x = element_blank())
```

> Sometimes a fit this bad is the result of a coding mistake. In this case, it is not. The model did correctly answer the question we asked of it: *What are the average probabilities of admission for women and men, across all departments?* The problem in this case is that men and women did not apply to the same departments, and departments vary in their rates of admission. This makes the answer misleading....
>
> Instead of asking *"What are the average probabilities of admission for women and men across all departments?"* we want to ask *"What is the average difference in probability of admission between women and men within departments?"* (pp. 342--343, *emphasis* in the original).

The model better suited to answer that question follows the form

\begin{align*}
\text{admit}_i    & \sim \operatorname{Binomial} (n_i, p_i) \\
\text{logit}(p_i) & = \alpha_{\text{gid}[i]} + \delta_{\text{dept}[i]} \\
\alpha_j          & \sim \operatorname{Normal} (0, 1.5) \\
\delta_k          & \sim \operatorname{Normal} (0, 1.5),
\end{align*}

where departments are indexed by $k$. To fit a model including two index variables like this in **brms**, we'll need to switch back to the non-linear syntax. Though if you'd like to see an analogous approach using conventional **brms** syntax, check out model `b10.9` in [Section 10.1.3](https://bookdown.org/content/3890/counting-and-classification.html#aggregated-binomial-graduate-school-admissions.) of my translation of McElreath's first edition. 

```{r b11.8}
b11.8 <-
  brm(data = d, 
      family = binomial,
      bf(admit | trials(applications) ~ a + d,
         a ~ 0 + gid, 
         d ~ 0 + dept,
         nl = TRUE),
      prior = c(prior(normal(0, 1.5), nlpar = a),
                prior(normal(0, 1.5), nlpar = d)),
      iter = 4000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.08") 
```

```{r}
print(b11.8)
```

Like with the earlier model, here we compute the difference score for $\alpha$ in two metrics.

```{r, warning = F}
as_draws_df(b11.8) %>% 
  mutate(diff_a = b_a_gidmale - b_a_gidfemale,
         diff_p = inv_logit_scaled(b_a_gidmale) - inv_logit_scaled(b_a_gidfemale)) %>% 
  pivot_longer(contains("diff")) %>% 
  group_by(name) %>% 
  mean_qi(value, .width = .89)
```

> Why did adding departments to the model change the inference about gender so much? The earlier figure gives you a hint--the rates of admission vary a lot across departments. Furthermore, women and men applied to different departments. Let's do a quick tabulation to show that: (p. 344)

Here's our **tidyverse**-style tabulation of the proportions of applicants in each department by `gid`.

```{r, warning = F, message = F}
d %>% 
  group_by(dept) %>% 
  mutate(proportion = applications / sum(applications)) %>% 
  select(dept, gid, proportion) %>% 
  pivot_wider(names_from = dept,
              values_from = proportion) %>% 
  mutate_if(is.double, round, digits = 2)
```

To make it even easier to see, we'll depict it in a tile plot.

```{r, fig.width = 3, fig.height = 1}
d %>% 
  group_by(dept) %>% 
  mutate(proportion = applications / sum(applications)) %>% 
  mutate(label = round(proportion, digits = 2),
         gid   = fct_rev(gid)) %>% 
  
  ggplot(aes(x = dept, y = gid, fill = proportion, label = label)) +
  geom_tile() +
  geom_text(aes(color = proportion > .25),
            family = "serif") +
  scale_fill_gradient(low = wes_palette("Moonrise2")[4],
                      high = wes_palette("Moonrise2")[1],
                      limits = c(0, 1)) +
  scale_color_manual(values = wes_palette("Moonrise2")[c(1, 4)]) +
  scale_x_discrete(NULL, position = "top") +
  ylab(NULL) +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks = element_blank(),
        legend.position = "none")
```

As it turns out, "The departments with a larger proportion of women applicants are also those with lower overall admissions rates" (p. 344). If we presume gender influences both choice of department and admission rates, we might depict that in a simple DAG where $G$ is applicant gender, $D$ is department, and $A$ is acceptance into grad school.

```{r, fig.width = 3, fig.height = 1.5, warning = F, message = F}
library(ggdag)

dag_coords <-
  tibble(name = c("G", "D", "A"),
         x    = c(1, 2, 3),
         y    = c(1, 2, 1))

dagify(D ~ G,
       A ~ D + G,
       coords = dag_coords) %>%
  
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_text(color = wes_palette("Moonrise2")[4], family = "serif") +
  geom_dag_edges(edge_color = wes_palette("Moonrise2")[4]) + 
  scale_x_continuous(NULL, breaks = NULL) +
  scale_y_continuous(NULL, breaks = NULL)
```

Although our `b11.8` model did not contain a parameter corresponding to the $G \rightarrow D$ pathway, it did condition on both $G$ and $D$. If we make another Figure like 11.5, we'll see conditioning on both substantially improved the posterior predictive distribution.

```{r, fig.height = 2.75, fig.width = 4.5}
predict(b11.8) %>% 
  data.frame() %>% 
  bind_cols(d) %>% 
  
  ggplot(aes(x = case, y = admit / applications)) +
  geom_pointrange(aes(y = Estimate / applications,
                      ymin = Q2.5 / applications ,
                      ymax = Q97.5 / applications),
                  color = wes_palette("Moonrise2")[1],
                  shape = 1, alpha = 1/3) +
  geom_point(color = wes_palette("Moonrise2")[2]) +
  geom_line(aes(group = dept),
            color = wes_palette("Moonrise2")[2]) +
  geom_text(data = text,
            aes(y = admit, label = dept),
            color = wes_palette("Moonrise2")[2],
            family = "serif") +
  scale_y_continuous("Proportion admitted", limits = 0:1) +
  labs(title = "Posterior validation check",
       subtitle = "Though imperfect, this model is a big improvement") +
  theme(axis.ticks.x = element_blank())
```

Here's the DAG that proposes an unobserved confound, $U$, that might better explain the $D \rightarrow A$ pathway.

```{r, fig.width = 3, fig.height = 1.5}
dag_coords <-
  tibble(name = c("G", "D", "A", "U"),
         x    = c(1, 2, 3, 3),
         y    = c(1, 2, 1, 2))

dagify(D ~ G + U,
       A ~ D + G + U,
       coords = dag_coords) %>%
  
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_point(x = 3, y = 2, 
             size = 5, color = wes_palette("Moonrise2")[2]) +
  geom_dag_text(color = wes_palette("Moonrise2")[4], family = "serif") +
  geom_dag_edges(edge_color = wes_palette("Moonrise2")[4]) + 
  scale_x_continuous(NULL, breaks = NULL) +
  scale_y_continuous(NULL, breaks = NULL)
```

McElreath recommended we look at the `pairs()` plot to get a sense of how highly correlated the parameters in our `b11.8` model are. Why not get a little extra about it and use custom settings the upper triangle, the diagonal, and the lower triangle with a `GGally::ggpairs()` plot? First we save our custom settings.

```{r}
my_upper <- function(data, mapping, ...) {
  
  # get the x and y data to use the other code
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  r  <- unname(cor.test(x, y)$estimate)
  rt <- format(r, digits = 2)[1]
  tt <- as.character(rt)
  
  # plot the cor value
  ggally_text(
    label = tt, 
    mapping = aes(),
    size = 4,
    color = wes_palette("Moonrise2")[4], 
    alpha = 4/5,
    family = "Times") +
    theme_void()
}

my_diag <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_density(fill = wes_palette("Moonrise2")[2], size = 0) +
    theme_void()
}

my_lower <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) + 
    geom_point(color = wes_palette("Moonrise2")[1], 
               size = 1/10, alpha = 1/10) +
    theme_void()
}
```

To learn more about the nature of the code for the `my_upper()` function, check out [Issue #139](https://github.com/ggobi/ggally/issues/139) in the [**GGally** GitHub repository](https://github.com/ggobi/ggally). Here is the plot.

```{r, fig.height = 5, fig.width = 5.5, warning = F, message = F}
library(GGally)

as_draws_df(b11.8) %>% 
  select(starts_with("b_")) %>% 
  set_names(c("alpha[male]", "alpha[female]", str_c("delta[", LETTERS[1:6], "]"))) %>%
  ggpairs(upper = list(continuous = my_upper),
          diag  = list(continuous = my_diag),
          lower = list(continuous = my_lower),
          labeller = "label_parsed") +
  labs(title = "Model: b11.8",
       subtitle = "The parameters are strongly correlated.") +
  theme(strip.text = element_text(size = 11))
```

> Why might we want to over-parameterize the model? Because it makes it easier to assign priors. If we made one of the genders baseline and measured the other as a deviation from it, we would stumble into the issue of assuming that the acceptance rate for one of the genders is pre-data more uncertain than the other. This isn't to say that over-parameterizing a model is always a good idea. But it isn't a violation of any statistical principle. You can always convert the posterior, post sampling, to any alternative parameterization. The only limitation is whether the algorithm we use to approximate the posterior can handle the high correlations. In this case, it can. (p. 345)

#### Rethinking: Simpson's paradox is not a paradox.

> This empirical example is a famous one in statistical teaching. It is often used to illustrate a phenomenon known as **Simpson’s paradox**. Like most paradoxes, there is no violation of logic, just of intuition. And since different people have different intuition, Simpson's paradox means different things to different people. The poor intuition being violated in this case is that a positive association in the entire population should also hold within each department. (p. 345, **emphasis** in the original)

In my field of clinical psychology, Simpson's paradox is an important, if under-appreciated, phenomenon. If you're in the social sciences as well, I highly recommend spending more time thinking about it. To get you started, I blogged about it [here](https://solomonkurz.netlify.app/post/2019-10-09-individuals-are-not-small-groups-i-simpson-s-paradox/) and @kievitSimpsonParadoxPsychological2013 wrote a great tutorial paper called [*Simpson's paradox in psychological science: a practical guide*](https://doi.org/10.3389/fpsyg.2013.00513).

## Poisson regression

> When a binomial distribution has a very small probability of an event $p$ and a very large number of trials $N$, then it takes on a special shape. The expected value of a binomial distribution is just $Np$, and its variance is $Np(1 - p)$. But when $N$ is very large and $p$ is very small, then these are approximately the same. (p. 346)

Data of this kind are often called count data. Here we simulate some.

```{r}
set.seed(11)

tibble(y = rbinom(1e5, 1000, 1/1000)) %>% 
  summarise(y_mean     = mean(y),
            y_variance = var(y))
```

Yes, those statistics are virtually the same. When dealing with pure Poisson data, $\mu = \sigma^2$. When you have a number of trials for which $n$ is unknown or much larger than seen in the data, the Poisson likelihood is a useful tool. We define it as

$$y_i \sim \text{Poisson}(\lambda),$$

where $\lambda$ expresses both mean and variance because within this model, the variance scales right along with the mean. Since $\lambda$ is constrained to be positive, we typically use the log link. Thus the basic Poisson regression model is

\begin{align*}
y_i             & \sim \operatorname{Poisson}(\lambda_i) \\
\log(\lambda_i) & = \alpha + \beta (x_i - \bar x),
\end{align*}

where all model parameters receive priors following the forms we've been practicing.

### Example: Oceanic tool complexity.

Load the `Kline` data [see @klinePopulationSizePredicts2010].

```{r, message = F, warning = F}
data(Kline, package = "rethinking")
d <- Kline
rm(Kline)

d
```

Here are our new columns.

```{r}
d <-
  d %>%
  mutate(log_pop_std = (log(population) - mean(log(population))) / sd(log(population)),
         cid         = contact)
```

Our statistical model will follow the form

\begin{align*}
\text{total_tools}_i & \sim \operatorname{Poisson}(\lambda_i) \\
\log(\lambda_i)      & = \alpha_{\text{cid}[i]} + \beta_{\text{cid}[i]} \text{log_pop_std}_i \\
\alpha_j             & \sim \; ? \\
\beta_j              & \sim \; ?, 
\end{align*}

where the priors for $\alpha_j$ and $\beta_j$ have yet be defined. If we continue our convention of using a Normal prior on the $\alpha$ parameters, we should recognize those will be log-Normal distributed on the outcome scale. Why? Because we're modeling $\lambda$ with the log link. Here's our version of Figure 11.7, depicting the two log-Normal priors considered in the text.

```{r, fig.width = 4, fig.height = 2.75}
tibble(x       = c(3, 22),
       y       = c(0.055, 0.04),
       meanlog = c(0, 3),
       sdlog   = c(10, 0.5)) %>% 
  expand(nesting(x, y, meanlog, sdlog),
         number = seq(from = 0, to = 100, length.out = 200)) %>% 
  mutate(density = dlnorm(number, meanlog, sdlog),
         group   = str_c("alpha%~%Normal(", meanlog, ", ", sdlog, ")")) %>% 
  
  ggplot(aes(fill = group, color = group)) +
  geom_area(aes(x = number, y = density),
            alpha = 3/4, size = 0, position = "identity") +
  geom_text(data = . %>% group_by(group) %>% slice(1),
            aes(x = x, y = y, label = group),
            family = "Times", parse = T,  hjust = 0) +
  scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
  scale_color_manual(values = wes_palette("Moonrise2")[1:2]) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("mean number of tools") +
  theme(legend.position = "none")
```

In this context, $\alpha \sim \operatorname{Normal}(0, 10)$ has a very long tail on the outcome scale. The mean of the log-Normal distribution, recall, is $\exp (\mu + \sigma^2/2)$. Here that is in code.

```{r}
exp(0 + 10^2 / 2)
```

That is very large. Here's the same thing in a simulation.

```{r}
set.seed(11)

rnorm(1e4, 0, 10) %>% 
  exp() %>% 
  mean()
```

Now compute the mean for the other prior under consideration, $\alpha \sim \operatorname{Normal}(3, 0.5)$.

```{r}
exp(3 + 0.5^2 / 2)
```

This is much smaller and more reasonable. In case you were curious, here are the same priors, this time on the scale of $\lambda$.

```{r, fig.width = 4, fig.height = 2.75}
tibble(x    = c(10, 4),
       y    = c(0.05, 0.5),
       mean = c(0, 3),
       sd   = c(10, 0.5)) %>% 
  expand(nesting(x, y, mean, sd),
         number = seq(from = -25, to = 25, length.out = 500)) %>% 
  mutate(density = dnorm(number, mean, sd),
         group   = str_c("alpha%~%Normal(", mean, ", ", sd, ")")) %>% 
  
  ggplot(aes(fill = group, color = group)) +
  geom_area(aes(x = number, y = density),
            alpha = 3/4, size = 0, position = "identity") +
  geom_text(data = . %>% group_by(group) %>% slice(1),
            aes(x = x, y = y, label = group),
            family = "Times", parse = T,  hjust = 0) +
  scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
  scale_color_manual(values = wes_palette("Moonrise2")[1:2]) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(lambda~scale)) +
  theme(legend.position = "none")
```

Now let's prepare to make the top row of Figure 11.8. In this portion of the figure, we consider the implications of two competing priors for $\beta$ while holding the prior for $\alpha$ at $\operatorname{Normal}(3, 0.5)$. The two $\beta$ priors under consideration are $\operatorname{Normal}(0, 10)$ and $\operatorname{Normal}(0, 0.2)$.

```{r, fig.width = 6, fig.height = 3}
set.seed(11)

# how many lines would you like?
n <- 100

# simulate and wrangle
tibble(i = 1:n,
       a = rnorm(n, mean = 3, sd = 0.5)) %>% 
  mutate(`beta%~%Normal(0*', '*10)`  = rnorm(n, mean = 0 , sd = 10),
         `beta%~%Normal(0*', '*0.2)` = rnorm(n, mean = 0 , sd = 0.2)) %>% 
  pivot_longer(contains("beta"),
               values_to = "b",
               names_to = "prior") %>% 
  expand(nesting(i, a, b, prior),
         x = seq(from = -2, to = 2, length.out = 100)) %>% 
  
  # plot
  ggplot(aes(x = x, y = exp(a + b * x), group = i)) +
  geom_line(size = 1/4, alpha = 2/3,
            color = wes_palette("Moonrise2")[4]) +
  labs(x = "log population (std)",
       y = "total tools") +
  coord_cartesian(ylim = c(0, 100)) +
  facet_wrap(~ prior, labeller = label_parsed)
```

It turns out that many of the lines considered plausible under $\operatorname{Normal}(0, 10)$ are disturbingly extreme. Here is what $\alpha \sim \operatorname{Normal}(3, 0.5)$ and $\beta \sim \operatorname{Normal}(0, 0.2)$ would mean when the $x$-axis is on the log population scale and the population scale.

```{r, fig.width = 6, fig.height = 3}
set.seed(11)

prior <-
  tibble(i = 1:n,
         a = rnorm(n, mean = 3, sd = 0.5),
         b = rnorm(n, mean = 0, sd = 0.2)) %>% 
  expand(nesting(i, a, b),
         x = seq(from = log(100), to = log(200000), length.out = 100))

# left
p1 <-
  prior %>% 
  ggplot(aes(x = x, y = exp(a + b * x), group = i)) +
  geom_line(size = 1/4, alpha = 2/3,
            color = wes_palette("Moonrise2")[4]) +
  labs(subtitle = expression(beta%~%Normal(0*', '*0.2)),
       x = "log population",
       y = "total tools") +
  coord_cartesian(xlim = c(log(100), log(200000)),
                  ylim = c(0, 500))
# right
p2 <-
  prior %>% 
  ggplot(aes(x = exp(x), y = exp(a + b * x), group = i)) +
  geom_line(size = 1/4, alpha = 2/3,
            color = wes_palette("Moonrise2")[4]) +
  labs(subtitle = expression(beta%~%Normal(0*', '*0.2)),
       x = "population",
       y = "total tools") +
  coord_cartesian(xlim = c(100, 200000),
                  ylim = c(0, 500))

# combine
p1 | p2
```

Okay, after settling on our two priors, the updated model formula is

\begin{align*}
y_i             & \sim \operatorname{Poisson}(\lambda_i) \\
\log(\lambda_i) & = \alpha + \beta (x_i - \bar x) \\
\alpha          & \sim \operatorname{Normal}(3, 0.5) \\
\beta           & \sim \operatorname{Normal}(0, 0.2).
\end{align*}

We're finally ready to fit the model. The only new thing in our model code is `family = poisson`. In this case, **brms** defaults to the `log()` link. We'll fit both an intercept-only Poisson model and an interaction model.

```{r b11.9}
# intercept only
b11.9 <-
  brm(data = d, 
      family = poisson,
      total_tools ~ 1,
      prior(normal(3, 0.5), class = Intercept),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.09") 

# interaction model
b11.10 <-
  brm(data = d, 
      family = poisson,
      bf(total_tools ~ a + b * log_pop_std,
         a + b ~ 0 + cid,
         nl = TRUE),
      prior = c(prior(normal(3, 0.5), nlpar = a),
                prior(normal(0, 0.2), nlpar = b)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.10") 
```

Check the model summaries.

```{r}
print(b11.9)
print(b11.10)
```

Now compute the LOO estimates and compare the models by the LOO.

```{r, message = F}
b11.9  <- add_criterion(b11.9, "loo")
b11.10 <- add_criterion(b11.10, "loo")

loo_compare(b11.9, b11.10, criterion = "loo") %>% print(simplify = F)
```

Here's the LOO weight.

```{r}
model_weights(b11.9, b11.10, weights = "loo") %>% round(digits = 2)
```

McElreath reported getting a warning from his `rethinking::compare()`. Our warning came from the `add_criterion()` function. We can inspect the Pareto $k$ values with `loo::pareto_k_table()`.

```{r}
loo(b11.10) %>% loo::pareto_k_table()
```

Let's take a closer look.

```{r}
tibble(culture = d$culture,
       k       = b11.10$criteria$loo$diagnostics$pareto_k) %>% 
  arrange(desc(k)) %>% 
  mutate_if(is.double, round, digits = 2)
```

It turns out Hawaii is very influential. Figure 11.9 will clarify why. Here we make the left panel.

```{r, fig.width = 3.5, fig.height = 3.5}
cultures <- c("Hawaii", "Tonga", "Trobriand", "Yap")

library(ggrepel)

nd <-
  distinct(d, cid) %>% 
  expand(cid, 
         log_pop_std = seq(from = -4.5, to = 2.5, length.out = 100))
f <- 
  fitted(b11.10,
         newdata = nd,
         probs = c(.055, .945)) %>%
  data.frame() %>%
  bind_cols(nd)

p1 <-
  f %>%
  ggplot(aes(x = log_pop_std, group = cid, color = cid)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
              stat = "identity",
              alpha = 1/4, size = 1/2) +
  geom_point(data = bind_cols(d, b11.10$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  geom_text_repel(data = 
                    bind_cols(d, b11.10$criteria$loo$diagnostics) %>% 
                    filter(culture %in% cultures) %>% 
                    mutate(label = str_c(culture, " (", round(pareto_k, digits = 2), ")")),
                  aes(y = total_tools, label = label), 
                  size = 3, seed = 11, color = "black", family = "Times") +
  labs(x = "log population (std)",
       y = "total tools") +
  coord_cartesian(xlim = range(b11.10$data$log_pop_std),
                  ylim = c(0, 80))
```

Now make the right panel of Figure 11.9.

```{r}
p2 <-
  f %>%
  mutate(population = exp((log_pop_std * sd(log(d$population))) + mean(log(d$population)))) %>% 

  ggplot(aes(x = population, group = cid, color = cid)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
              stat = "identity",
              alpha = 1/4, size = 1/2) +
  geom_point(data = bind_cols(d, b11.10$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) +
  ylab("total tools") +
  coord_cartesian(xlim = range(d$population),
                  ylim = c(0, 80))
```

Combine the two subplots with **patchwork** and adjust the settings a little.

```{r, fig.width = 7, fig.height = 3.25}
(p1 | p2) &
  scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) &
  scale_color_manual(values = wes_palette("Moonrise2")[1:2]) &
  scale_size(range = c(2, 5)) &
  theme(legend.position = "none")
```

Hawaii is influential in that it has a very large population relative to the other islands.

#### Overthinking: Modeling tool innovation.

McElreath's theoretical, or scientific, model for `total_tools` is

$$\widehat{\text{total_tools}} = \frac{\alpha_{\text{cid}[i]} \: \text{population}^{\beta_{\text{cid}[i]}}}{\gamma}.$$

We can use the Poisson likelihood to express this in a Bayesian model as

\begin{align*}
\text{total_tools} & \sim \operatorname{Poisson}(\lambda_i) \\
\lambda_i & = \left[ \exp (\alpha_{\text{cid}[i]}) \text{population}_i^{\beta_{\text{cid}[i]}} \right] / \gamma \\
\alpha_j  & \sim \operatorname{Normal}(1, 1) \\
\beta_j   & \sim \operatorname{Exponential}(1) \\
\gamma    & \sim \operatorname{Exponential}(1),
\end{align*}

where we exponentiate $\alpha_{\text{cid}[i]}$ to restrict the posterior to zero and above. Here's how we might fit that model with **brms**.

```{r b11.11}
b11.11 <-
  brm(data = d, 
      family = poisson(link = "identity"),
      bf(total_tools ~ exp(a) * population^b / g,
         a + b ~ 0 + cid,
         g ~ 1,
         nl = TRUE),
      prior = c(prior(normal(1, 1), nlpar = a),
                prior(exponential(1), nlpar = b, lb = 0),
                prior(exponential(1), nlpar = g, lb = 0)),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      control = list(adapt_delta = .85),
      file = "fits/b11.11")
```

Did you notice the `family = poisson(link = "identity")` part of the code? Yes, it's possible to use the Poisson likelihood without the log link. However, if you're going to buck tradition and use some other link, make sure you know what you're doing.

Check the model summary.

```{r}
print(b11.11)
```

Compute and check the PSIS-LOO estimates along with their diagnostic Pareto $k$ values.

```{r, message = F}
b11.11 <- add_criterion(b11.11, criterion = "loo", moment_match = T)
loo(b11.11)
```

The first time through, we still had Pareto high $k$ values. Recall that due to the very small sample size, this isn't entirely surprising. Newer versions of **brms** might prompt you to set `moment_match = TRUE`, which is what I did, here. You might perform the operation both ways to get a sense of the difference.

Okay, it's time to make Figure 11.10.

```{r, fig.width = 3.5, fig.height = 3.25}
# for the annotation
text <-
  distinct(d, cid) %>% 
  mutate(population  = c(210000, 72500),
         total_tools = c(59, 68),
         label       = str_c(cid, " contact"))

# redefine the new data
nd <-
  distinct(d, cid) %>% 
  expand(cid, 
         population = seq(from = 0, to = 300000, length.out = 100))

# compute the poster predictions for lambda
fitted(b11.11,
       newdata = nd,
       probs = c(.055, .945)) %>%
  data.frame() %>%
  bind_cols(nd) %>%
  
  # plot!
  ggplot(aes(x = population, group = cid, color = cid)) +
  geom_smooth(aes(y = Estimate, ymin = Q5.5, ymax = Q94.5, fill = cid),
              stat = "identity",
              alpha = 1/4, size = 1/2) +
  geom_point(data = bind_cols(d, b11.11$criteria$loo$diagnostics),
             aes(y = total_tools, size = pareto_k),
             alpha = 4/5) +
  geom_text(data = text,
            aes(y = total_tools, label = label),
            family = "serif") +
  scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
  scale_color_manual(values = wes_palette("Moonrise2")[1:2]) +
  scale_size(range = c(2, 5)) +
  scale_x_continuous("population", breaks = c(0, 50000, 150000, 250000)) +
  ylab("total tools") +
  coord_cartesian(xlim = range(d$population),
                  ylim = range(d$total_tools)) +
  theme(legend.position = "none")
```

In case you were curious, here are the results if we compare `b11.10` and `b11.11` by the PSIS-LOO.

```{r}
loo_compare(b11.10, b11.11, criterion = "loo") %>% print(simplify = F)

model_weights(b11.10, b11.11, weights = "loo") %>% round(digits = 3)
```

Finally, here's a comparison of the two models by the Pareto $k$ values.

```{r, fig.width = 3.25, fig.height = 1.75}
tibble(b11.10 = b11.10$criteria$loo$diagnostics$pareto_k,
       b11.11 = b11.11$criteria$loo$diagnostics$pareto_k) %>% 
  pivot_longer(everything()) %>% 
  
  ggplot(aes(x = value, y = name)) +
  geom_vline(xintercept = c(.5, .7, 1), linetype = 3, color = wes_palette("Moonrise2")[2]) +
  stat_dots(slab_fill = wes_palette("Moonrise2")[1], 
            slab_color = wes_palette("Moonrise2")[1]) + 
  scale_x_continuous(expression(Pareto~italic(k)), breaks = c(.5, .7, 1)) +
  ylab(NULL) +
  coord_cartesian(ylim = c(1.5, 2.4))
```

### Negative binomial (gamma-Poisson) models.

> Typically there is a lot of unexplained variation in Poisson models. Presumably this additional variation arises from unobserved influences that vary from case to case, generating variation in the true $\lambda$'s. Ignoring this variation, or *rate heterogeneity*, can cause confounds just like it can for binomial models. So a very common extension of Poisson GLMs is to swap the Poisson distribution for something called the **negative binomial** distribution. This is really a Poisson distribution in disguise, and it is also sometimes called the **gamma-Poisson** distribution for this reason. It is a Poisson in disguise, because it is a mixture of different Poisson distributions. This is the Poisson analogue of the Student-t model, which is a mixture of different normal distributions. We'll work with mixtures in the next chapter. (p. 357, *emphasis* in the original)

### Example: Exposure and the offset.

> For the last Poisson example, we'll look at a case where the exposure varies across observations. When the length of observation, area of sampling, or intensity of sampling varies, the counts we observe also naturally vary. Since a Poisson distribution assumes that the rate of events is constant in time (or space), it's easy to handle this. All we need to do, as explained above, is to add the logarithm of the exposure to the linear model. The term we add is typically called an *offset*. (p. 357, *emphasis* in the original)

Here we simulate our data.

```{r}
set.seed(11)

num_days <- 30
y        <- rpois(num_days, lambda = 1.5)

num_weeks <- 4
y_new     <- rpois(num_weeks, lambda = 0.5 * 7)
```

Now tidy the data and add `log_days`.

```{r}
(
  d <- 
  tibble(y         = c(y, y_new), 
         days      = rep(c(1, 7), times = c(num_days, num_weeks)),  # this is the exposure
         monastery = rep(0:1, times = c(num_days, num_weeks))) %>%
  mutate(log_days = log(days))
)
```

Within the context of the Poisson likelihood, we can decompose $\lambda$ into two parts, $\mu$ (mean) and $\tau$ (exposure), like this:

$$
y_i \sim \operatorname{Poisson}(\lambda_i) \\
\log \lambda_i = \log \frac{\mu_i}{\tau_i} = \log \mu_i - \log \tau_i.
$$

Therefore, you can rewrite the equation if the exposure ($\tau$) varies in your data and you still want to model the mean ($\mu$). Using the model we're about to fit as an example, here's what that might look like:

\begin{align*}
y_i & \sim \operatorname{Poisson}(\mu_i) \\
\log \mu_i & = \color{#a4692f}{\log \tau_i} + \alpha + \beta \text{monastery}_i \\
\alpha     & \sim \operatorname{Normal}(0, 1) \\
\beta      & \sim \operatorname{Normal}(0, 1),
\end{align*}

where the offset $\log \tau_i$ does not get a prior. In this context, its value is added directly to the right side of the formula. With the **brms** package, you use the `offset()` function in the `formula` syntax. You just insert a pre-processed variable like `log_days` or the log of a variable, such as `log(days)`. Fit the model.

```{r b11.12}
b11.12 <-
  brm(data = d, 
      family = poisson,
      y ~ 1 + offset(log_days) + monastery,
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(normal(0, 1), class = b)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.12")
```

As we look at the model summary, keep in mind that the parameters are on the per-one-unit-of-time scale. Since we simulated the data based on summary information from two units of time--one day and seven days--, this means the parameters are in the scale of $\log (\lambda)$ per one day.

```{r}
print(b11.12)
```

The model summary helps clarify that when you use `offset()`, `brm()` fixes the value. Thus there is no parameter estimate for the `offset()`. It's a fixed part of the model not unlike the $\nu$ parameter of the Student-$t$ distribution gets fixed to infinity when you use the Gaussian likelihood.

To get the posterior distributions for average daily outputs for the old and new monasteries, respectively, we'll use use the formulas

\begin{align*}
\lambda_\text{old} & = \exp (\alpha) \;\;\; \text{and} \\
\lambda_\text{new} & = \exp (\alpha + \beta_\text{monastery}).
\end{align*}

Following those transformations, we'll summarize the $\lambda$ distributions with medians and 89% HDIs with help from the `tidybayes::mean_hdi()` function.

```{r, message = F, warning = F}
posterior_samples(b11.12) %>%
  mutate(lambda_old = exp(b_Intercept),
         lambda_new = exp(b_Intercept + b_monastery)) %>%
  pivot_longer(contains("lambda")) %>% 
  mutate(name = factor(name, levels = c("lambda_old", "lambda_new"))) %>%
  group_by(name) %>%
  mean_hdi(value, .width = .89) %>% 
  mutate_if(is.double, round, digits = 2)
```

Because we don't know what seed McElreath used to simulate his data, our simulated data differed a little from his and, as a consequence, our results differ a little, too.

## Multinomial and categorical models

> When more than two types of unordered events are possible, and the probability of each type of event is constant across trials, then the maximum entropy distribution is the **multinomial distribution**. [We] already met the multinomial, implicitly, in [Chapter 10][Big Entropy and the Generalized Linear Model] when we tossed pebbles into buckets as an introduction to maximum entropy. The binomial is really a special case of this distribution. And so its distribution formula resembles the binomial, just extrapolated out to three or more types of events. If there are $K$ types of events with probabilities $p_1, \dots, p_K$, then the probability of observing $y_1, \dots, y_K$ events of each type out of n total trials is:
>
> $$\operatorname{Pr} (y_1, \dots, y_K | n, p_1, \dots, p_K) = \frac{n!}{\prod_i y_i!} \prod_{i = 1}^K p_i^{y_i}$$
>
> The fraction with $n!$ on top just expresses the number of different orderings that give the same counts $y_1, \dots, y_K$. It's the famous multiplicity from the previous chapter....
>
> The conventional and natural link in this context is the **multinomial logit**, also known as the **softmax** function. This link function takes a vector of scores, one for each of $K$ event types, and computes the probability of a particular type of event $k$ as
>
> $$\text{Pr} (k |s_1, s_2, \dots, s_K) = \frac{\exp (s_k)}{\sum_{i = 1}^K \exp (s_i)}$$
> (p. 359, **emphasis** in the original)

McElreath then went on to explain how multinomial logistic regression models are among the more difficult of the GLMs to master. He wasn't kidding. To get a grasp on these, we'll cover them in a little more detail than he did in the text. Before we begin, I'd like to give a big shout out to [Adam Bear](https://adambear.me/), whose initial comment on a [GitHub issue](https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse/issues/5) turned into a friendly and productive email collaboration on what, exactly, is going on with this section. Hopefully we got it.

### Predictors matched to outcomes.

To begin, let's simulate the data just like McElreath did in the **R** code 11.55 block.

```{r, warning = F, message = F}
library(rethinking)

# simulate career choices among 500 individuals
n      <- 500           # number of individuals
income <- c(1, 2, 5)    # expected income of each career
score  <- 0.5 * income  # scores for each career, based on income

# next line converts scores to probabilities
p <- softmax(score[1], score[2], score[3])

# now simulate choice
# outcome career holds event type values, not counts
career <- rep(NA, n)  # empty vector of choices for each individual

# sample chosen career for each individual
set.seed(34302)
# sample chosen career for each individual
for(i in 1:n) career[i] <- sample(1:3, size = 1, prob = p)
```

Before moving on, it might be useful to examine what we just did. With the three lines below the "# simulate career choices among 500 individuals" comment, we defined the formulas for three scores. Those were

\begin{align*}
s_1 & = 0.5 \times \text{income}_1 \\
s_2 & = 0.5 \times \text{income}_2 \\ 
s_3 & = 0.5 \times \text{income}_3,
\end{align*}

where $\text{income}_1 = 1$, $\text{income}_2 = 2$, and $\text{income}_3 = 5$. What's a little odd about this setup and conceptually important to get is that although $\text{income}_i$ varies across the three levels of $s$, the $\text{income}_i$ value is constant within each level of $s$. E.g., $\text{income}_1$ is not a variable within the context of $s_1$. Therefore, we could also write the above as

\begin{align*}
s_1 & = 0.5 \cdot 1 = 0.5 \\
s_2 & = 0.5 \cdot 2 = 1.0 \\ 
s_3 & = 0.5 \cdot 5 = 2.5.
\end{align*}

Let's confirm.

```{r}
print(score)
```

We then converted those `score` values to probabilities with the `softmax()` function. This will become important when we set up the model code. For now, here's what the data look like.

```{r, fig.width = 3, fig.height = 2.25, message = F, warning = F}
# put them in a tibble
d <-
  tibble(career = career) %>% 
  mutate(career_income = ifelse(career == 3, 5, career))

# plot 
d %>%
  ggplot(aes(x = career)) +
  geom_bar(size = 0, fill = wes_palette("Moonrise2")[2])
```

Our `career` variable is composed of three categories, `1:3`, with each category more likely than the one before. Here's a breakdown of the counts, percentages, and probabilities of each category.

```{r}
d %>% 
  count(career) %>% 
  mutate(percent     = (100 * n / sum(n)),
         probability =        n / sum(n))
```

To further build an appreciation for how we simulated data with these proportions and how the process links in with the formulas, above, we'll retrace the first few simulation steps within a **tidyverse**-centric workflow. Recall how in those first few steps we defined values for `income`, `score`, and `p`. Here they are again in a tibble.

```{r}
tibble(income = c(1, 2, 5)) %>% 
  mutate(score = 0.5 * income) %>% 
  mutate(p = exp(score) / sum(exp(score)))
```

Notice how the values in the `p` column match up well with the `probability` values from the output from the block just above. Our simulation successfully produces data corresponding to the data-generating values. Woot! Also note how the code we just used to compute those `p` values, `p = exp(score) / sum(exp(score))`, corresponds nicely with the formula from above,

$$\text{Pr} (k |s_1, s_2, \dots, s_K) = \frac{\exp (s_k)}{\sum_{i = 1}^K \exp (s_i)}.$$

What still might seem mysterious is what those $s$ values in the equation are. In the simulation and in the prose, McElreath called them *scores*. Another way to think about them is as weights. The thing to get is that their exact values aren't important so much as their difference one from another. You'll note that `score` for `income == 2` was 0.5 larger than that of `income == 1`. The same was true for `income == 3` and `income == 2`. So if we add an arbitrary constant to each of those `score` values, like 11, we'll get the same `p` values.

```{r}
tibble(income        = c(1, 2, 5), 
       some_constant = 11) %>% 
  mutate(score = (0.5 * income) + some_constant) %>% 
  mutate(p = exp(score) / sum(exp(score)))
```

Now keeping that in mind, recall how McElreath said that though we have $K$ categories, $K = 3$ in this case, we only estimate $K - 1$ linear models. "In a multinomial (or categorical) GLM, you need $K - 1$ linear models for $K$ types of events. One of the outcome values is chosen as a 'pivot' and the others are modeled relative to it." (p. 360). You could also think of the pivot category as the reference category.

Before we practice fitting multinomial models with **brms**, it'll be helpful if we first follow along with the text and fit the model directly in Stan. We will be working directly with Stan very infrequently in this ebook. If you're interested in learning more about modeling directly with Stan, you might check out the [*Stan user's guide*](https://mc-stan.org/docs/stan-users-guide/index.html) [@standevelopmentteamStanUserGuide2022], the [*Stan reference manual*](https://mc-stan.org/docs/reference-manual/index.html) [@standevelopmentteamStanReferenceManual2022], and the [*Stan functions reference*](https://mc-stan.org/docs/functions-reference/index.html) [@standevelopmentteamStanFunctionsReference2022]. Fit the model with Stan.

```{r m11.13, echo = F}
# save(list = c("code_m11.13", "dat_list", "m11.13"), file = "fits/m11.13.rda")

load(file = "fits/m11.13.rda")
```

```{r, eval = F}
# define the model
code_m11.13 <- "
data{
  int N; // number of individuals
  int K; // number of possible careers 
  int career[N]; // outcome
  vector[K] career_income;
}
parameters{
  vector[K - 1] a; // intercepts
  real<lower=0> b; // association of income with choice
}
model{
  vector[K] p;
  vector[K] s;
  a ~ normal(0, 1);
  b ~ normal(0, 0.5);
  s[1] = a[1] + b * career_income[1]; 
  s[2] = a[2] + b * career_income[2]; 
  s[3] = 0; // pivot
  p = softmax(s);
  career ~ categorical(p);
} 
"

# wrangle the data
dat_list <- 
  list(N = n, 
       K = 3, 
       career = career, 
       career_income = income)

# fit the model
m11.13 <- 
  stan(data = dat_list,
       model_code = code_m11.13,
       chains = 4)
```

Check the summary.

```{r}
precis(m11.13, depth = 2) %>% round(digits = 2)
```

One of the primary reasons we went through this exercise is to show that McElreath's **R** code 11.56 and 11.57 do not return the results he reported on page 361. The plot thickens when we attempt the counterfactual simulation on page 362, as reported in **R** code 11.58.

```{r}
post <- extract.samples(m11.13)

# set up logit scores
s1      <- with(post, a[, 1] + b * income[1])
s2_orig <- with(post, a[, 2] + b * income[2])
s2_new  <- with(post, a[, 2] + b * income[2] * 2)

# compute probabilities for original and counterfactual 
p_orig <- sapply(1:length(post$b), function(i)
  softmax(c(s1[i], s2_orig[i], 0)))

p_new <- sapply(1:length(post$b), function(i)
  softmax(c(s1[i], s2_new[i], 0)))

# summarize
p_diff <- p_new[2, ] - p_orig[2, ] 
precis(p_diff)
```

Even though we used the same code, our counterfactual simulation doesn't match up with the results McElreath reported in the text, either. Keep this all in mind as we switch to **brms**. But before we move on to **brms**, check this out.

```{r}
data.frame(s1 = score[3] + s1, 
           s2 = score[3] + s2_orig, 
           s3 = score[3] + 0) %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  mean_qi(value) %>% 
  mutate_if(is.double, round, digits = 2)
```

In his Stan code (**R** code 11.56), you'll see McElreath chose the third category to be his pivot and that he used zero as a constant value. As it turns out, it is common practice to set the score value for the reference category to zero. It's also a common practice to use the first event type as the reference category. Importantly, in his [-@Bürkner2022Parameterization] vignette, [*Parameterization of response distributions in brms*](https://cran.r-project.org/package=brms/vignettes/brms_families.html#ordinal-and-categorical-models), Bürkner clarified the **brms** default is to use the first response category as the reference and set it to a zero as well. However, we can control this behavior with the `refcat` argument. In the examples to follow, we'll follow McElreath and use the third event type as the reference category by setting `refcat = 3`.

In addition to the discrepancies with the code and results in the text, one of the things I don't care for in this section is how fast McElreath covered the material. Our approach will be to slow down a little and start off by fitting a intercepts-only model before adding the covariate. Before we fit the model, we might take a quick look at the prior structure with `brms::get_prior()`.

```{r}
get_prior(data = d, 
          family = categorical(link = logit, refcat = 3),
          career ~ 1)
```

We have two "intercepts", which are differentiated in the `dpar` column. We'll talk more about what these are in just a bit; don't worry. I show this here because as of **brms** 2.12.0, "specifying global priors for regression coefficients in categorical models is deprecated." The upshot is even if we want to use the same prior for both, we need to use the `dpar` argument for each. With that in mind, here's our multinomial model in **brms**. Do note the specification `family = categorical(link = logit, refcat = 3)`. The `categorical` part is what instructs **brms** to use the multinomial likelihood and the `refcat = 3` part will allow us to use the third event type as the pivot.

```{r}
b11.13io <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      career ~ 1,
      prior = c(prior(normal(0, 1), class = Intercept, dpar = mu1),
                prior(normal(0, 1), class = Intercept, dpar = mu2)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.13io")
```

The summary can be difficult to interpret.

```{r}
print(b11.13io)
```

`brms::brm()` referred to the $K$ categories as `mu1`, `mu2`, and `mu3`. Since `career == 3` is the reference category, the score for which was set to zero, there is no parameter for `mu3_Intercept`. That's a zero. Now notice how `mu1_Intercept` is about -2 and `mu2_Intercept` is about -1.5. If we double back to the `income` and `score` values we played with at the beginning of this section, you'll notice that the score for the reference category was 2.5. Here's what happens if we rescale the three scores such that the `score` value for the reference category is 0.

```{r}
tibble(income = c(1, 2, 5)) %>% 
  mutate(score = 0.5 * income) %>% 
  mutate(rescaled_score = score - 2.5)
```

Now notice how the `rescaled_score` values for the first two rows correspond nicely to `mu1_Intercept` and `mu2_Intercept` from our model. What I hope this clarifies is that our statistical model returned the scores. But recall these are not quite probabilities. *Why?* Because the weights are all relative to one another. The easiest way to get what we want, the probabilities for the three categories, is with `brms::fitted()`. Since this model has no predictors, only intercepts, we won't specify any `newdata`. In such a case, `fitted()` will return fitted values for each case in the data. Going slow, let's take a look at the structure of the output.

```{r}
fitted(b11.13io) %>% str()
```

Just as expected, we have 500 rows--one for each case in the original data. We have four summary columns, the typical `Estimate`, `Est.Error`, `Q2.5`, and `Q97.5`. We also have third dimension composed of three levels, `P(Y = 1)`, `P(Y = 2)`, and `P(Y = 3)`. Those index which of the three career categories each probability summary is for. Since the results are identical for each row, we'll simplify the output by only keeping the first row.

```{r}
fitted(b11.13io)[1, , ] %>% 
  round(digits = 2)
```

If we take the transpose of that, it will put the results in the format to which we've become accustomed.

```{r}
fitted(b11.13io)[1, , ] %>% 
  round(digits = 2) %>% 
  t()
```

Now compare those summaries with the empirically-derived percent and probability values we computed earlier.

```{r}
tibble(income = c(1, 2, 5)) %>% 
  mutate(score = 0.5 * income) %>% 
  mutate(p = exp(score) / sum(exp(score)))
```

Now here's how to make use of the formula from the last `mutate()` line, $\frac{\exp (s_k)}{\sum_{i = 1}^K \exp (s_i)}$, to compute the marginal probabilities from `b11.13io` by hand.

```{r, warning = F}
as_draws_df(b11.13io) %>% 
  mutate(b_mu3_Intercept = 0) %>% 
  mutate(p1 = exp(b_mu1_Intercept) / (exp(b_mu1_Intercept) + exp(b_mu2_Intercept) + exp(b_mu3_Intercept)),
         p2 = exp(b_mu2_Intercept) / (exp(b_mu1_Intercept) + exp(b_mu2_Intercept) + exp(b_mu3_Intercept)),
         p3 = exp(b_mu3_Intercept) / (exp(b_mu1_Intercept) + exp(b_mu2_Intercept) + exp(b_mu3_Intercept))) %>% 
  pivot_longer(p1:p3) %>% 
  group_by(name) %>% 
  mean_qi(value) %>% 
  mutate_if(is.double, round, digits = 2)
```

Hurray; we did it! Not only did we fit a simple multinomial model with **brms**, we actually made sense of the parameters by connecting them to the original data-generating values. We're almost ready to contend with the model McElreath fit with `stan()`. But before we do, it'll be helpful to show alternative ways to fit these models. We used conventional style syntax when we fit `b11.13io`. There are at least two alternative ways to fit the model:

```{r, eval = F}
# verbose syntax
b11.13io_verbose <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         mu1 ~ 1,
         mu2 ~ 1),
      prior = c(prior(normal(0, 1), class = Intercept, dpar = mu1),
                prior(normal(0, 1), class = Intercept, dpar = mu2)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.13io_verbose")

# nonlinear syntax
b11.13io_nonlinear <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         nlf(mu1 ~ a1),
         nlf(mu2 ~ a2),
         a1 + a2 ~ 1),
      prior = c(prior(normal(0, 1), class = b, nlpar = a1),
                prior(normal(0, 1), class = b, nlpar = a2)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.13io_nonlinear")
```

```{r, eval = F, echo = F}
# not shown for the sake of space
print(b11.13io)
print(b11.13io_verbose)
print(b11.13io_nonlinear)
```

For the sake of space, I'm not going to show the results for those two models. If you fit them yourself, you'll see the results for `b11.13io` and `b11.13io_verbose` are exactly the same and `b11.13io_nonlinear` differs from them only within simulation variation. I point this out because it's the nonlinear approach that will allow us to fit a model like McElreath's `m11.13`. My hope is the syntax we used in the `b11.13io_verbose` model will help clarify what's going on with the non-linear syntax. When we fit multinomial models with **brms**, the terse conventional `formula` syntax might not make clear how there are actually $K - 1$ formulas. The more verbose syntax of our `b11.13io_verbose` model shows how we can specify those models directly. In our case, that was with those `mu1 ~ 1, mu2 ~ 1` lines. Had we used the **brms** default and used the first level of `career` as the pivot, those lines would have instead been `mu2 ~ 1, mu3 ~ 1`. So anyway, when we switch to the non-linear syntax, we explicitly model `mu1` and `mu2` and, as is typical of the non-linear syntax, we name our parameters. You can see another comparison of these three ways of fitting a multinomial model at the [Nonlinear syntax with a multinomial model?](https://discourse.mc-stan.org/t/nonlinear-syntax-with-a-multinomial-model/16122) thread on the Stan Forums.

Now it's time to focus on the **brms** version of McElreath's `m11.13`. To my eye, McElreath's model has two odd features. First, though he has two intercepts, he only has one $\beta$ parameter. Second, if you look at McElreath's `parameters` block, you'll see that he restricted his $\beta$ parameter to be zero and above (`real<lower=0> b;`).

With the **brms** non-linear syntax, we can fit the model with one $\beta$ parameter or allow the one $\beta$ parameter to differ for `mu1` and `mu2`. As to setting a lower bound to the `b` parameter[s], we can do that with the `lb` argument within the `prior()` function. If we fit our version of `m11.13` by systemically varying these two features, we'll end up with the four versions listed in the table below.

```{r}
crossing(b  = factor(c("b1 & b2", "b"), levels = c("b1 & b2", "b")),
         lb = factor(c("NA", 0), levels = c("NA", 0))) %>% 
  mutate(fit = str_c("b11.13", letters[1:n()])) %>% 
  select(fit, everything()) %>% 
  
  flextable() %>% 
  width(width = 1.25)
```

Fit `b11.13a` through `b11.13d`, the four variants on the model.

```{r b11.13a}
b11.13a <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         nlf(mu1 ~ a1 + b1 * 1),
         nlf(mu2 ~ a2 + b2 * 2),
         a1 + a2 + b1 + b2 ~ 1),
      prior = c(prior(normal(0, 1), class = b, nlpar = a1),
                prior(normal(0, 1), class = b, nlpar = a2),
                prior(normal(0, 0.5), class = b, nlpar = b1),
                prior(normal(0, 0.5), class = b, nlpar = b2)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.13a")

b11.13b <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         nlf(mu1 ~ a1 + b1 * 1),
         nlf(mu2 ~ a2 + b2 * 2),
         a1 + a2 + b1 + b2 ~ 1),
      prior = c(prior(normal(0, 1), class = b, nlpar = a1),
                prior(normal(0, 1), class = b, nlpar = a2),
                prior(normal(0, 0.5), class = b, nlpar = b1, lb = 0),
                prior(normal(0, 0.5), class = b, nlpar = b2, lb = 0)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      control = list(adapt_delta = .99),
      file = "fits/b11.13b")

b11.13c <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         nlf(mu1 ~ a1 + b * 1),
         nlf(mu2 ~ a2 + b * 2),
         a1 + a2 + b ~ 1),
      prior = c(prior(normal(0, 1), class = b, nlpar = a1),
                prior(normal(0, 1), class = b, nlpar = a2),
                prior(normal(0, 0.5), class = b, nlpar = b)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.13c")

b11.13d <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         nlf(mu1 ~ a1 + b * 1),
         nlf(mu2 ~ a2 + b * 2),
         a1 + a2 + b ~ 1),
      prior = c(prior(normal(0, 1), class = b, nlpar = a1),
                prior(normal(0, 1), class = b, nlpar = a2),
                prior(normal(0, 0.5), class = b, nlpar = b, lb = 0)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      control = list(adapt_delta = .99),
      file = "fits/b11.13d")
```

```{r, eval = F, echo = F}
print(b11.13a)
print(b11.13b)
print(b11.13c)
print(b11.13d)
```

I'm not going to exhaustively show the `print()` output for each. If you check, you'll see they all fit reasonably well. Here we'll look at their parameter summaries in bulk with a coefficient plot.

```{r, fig.width = 7, fig.height = 1.25}
tibble(fit = str_c("b11.13", letters[1:4])) %>% 
  mutate(fixef = purrr::map(fit, ~ get(.) %>% 
                              fixef() %>%
                              data.frame() %>% 
                              rownames_to_column("parameter"))) %>% 
  unnest(fixef) %>% 
  mutate(parameter = str_remove(parameter, "_Intercept"),
         fit       = factor(fit, levels = str_c("b11.13", letters[4:1]))) %>% 
  
  ggplot(aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = fit)) +
  geom_vline(xintercept = 0, color = wes_palette("Moonrise2")[3]) +
  geom_pointrange(fatten = 3/2, color = wes_palette("Moonrise2")[4]) +
  ylab(NULL) +
  theme(axis.ticks.y = element_blank(),
        panel.background = element_rect(fill = alpha("white", 1/8), size = 0)) +
  facet_wrap(~ parameter, nrow = 1)
```

The results differed across models. None of them match up with the results McElreath reported in the text. However, the parameters from `b11.13d` are very close to those from our `m11.13`.

```{r}
precis(m11.13, depth = 2)
fixef(b11.13d) %>% round(digits = 2)
```

It might be instructive to compare `b11.13a` through `b11.13d` with the PSIS-LOO.

```{r, warning = F, message = F}
b11.13a <- add_criterion(b11.13a, "loo")
b11.13b <- add_criterion(b11.13b, "loo")
b11.13c <- add_criterion(b11.13c, "loo")
b11.13d <- add_criterion(b11.13d, "loo")

loo_compare(b11.13a, b11.13b, b11.13c, b11.13d, criterion = "loo") %>% 
  print(simplify = F)
model_weights(b11.13a, b11.13b, b11.13c, b11.13d, weights = "loo") %>% 
  round(digits = 2)
```

Two things pop out, here. First, all models are essentially equivalent in terms of LOO estimates and LOO weights. Second, the effective number of parameters ($p_\text{LOO}$) is about 2 for each model. At first glance, this might be surprising given that `b11.13a` and `b11.13b` both have 4 parameters and `b11.13c` and `b11.13d` both have three parameters. But recall that none of these models contain predictor variables from the data. All those $\beta$ parameters, whether they're held equal or allowed to vary across $s_1$ and $s_2$, are just constants. In the absence of actual `income` values that vary within the data, those $\beta$ parameters are kinda like extra intercepts. For context, go back and review our multicollinear legs from [Section 6.1.1][Multicollinear legs.] or our double intercepts from [Section 9.5.4][Non-identifiable parameters.].

Now see what happens when we compare these four models with our intercepts-only model, `b11.13io`.

```{r, warning = F, message = F}
b11.13io <- add_criterion(b11.13io, "loo")

loo_compare(b11.13io, b11.13a, b11.13b, b11.13c, b11.13d, criterion = "loo") %>% 
  print(simplify = F)
model_weights(b11.13io, b11.13a, b11.13b, b11.13c, b11.13d, weights = "loo") %>% 
  round(digits = 2)
```

They're all the same. Each model effectively has 2 parameters. Though it doesn't do much by way of cross-validation, McElreath's extra $\beta$ parameter will let us perform a counterfactual simulation. Here is a **brms**/**tidyverse** workflow to make a counterfactual simulation for two levels of `income` based on our `b11.13d`, the **brms** model most closely corresponding to our **rethinking**-based `m11.13`.

```{r, warning = F}
as_draws_df(b11.13d) %>% 
  transmute(s1      = b_a1_Intercept + b_b_Intercept * income[1],
            s2_orig = b_a2_Intercept + b_b_Intercept * income[2],
            s2_new  = b_a2_Intercept + b_b_Intercept * income[2] * 2) %>% 
  mutate(p_orig = purrr::map2_dbl(s1, s2_orig, ~softmax(.x, .y, 0)[2]),
         p_new  = purrr::map2_dbl(s1, s2_new, ~softmax(.x, .y, 0)[2])) %>% 
  mutate(p_diff = p_new - p_orig) %>% 
  mean_qi(p_diff) %>% 
  mutate_if(is.double, round, digits = 2)
```

Now let's build.

### Predictors matched to observations.

> Now consider an example in which each observed outcome has unique predictor values. Suppose you are still modeling career choice. But now you want to estimate the association between each person’s family income and which career they choose. So the predictor variable must have the same value in each linear model, for each row in the data. But now there is a unique parameter multiplying it in each linear model. This provides an estimate of the impact of family income on choice, for each type of career. (p. 362)

```{r, warning = F, message = F}
n <- 500
set.seed(11)

# simulate family incomes for each individual
family_income <- runif(n)

# assign a unique coefficient for each type of event
b      <- c(-2, 0, 2)
career <- rep(NA, n)  # empty vector of choices for each individual
for (i in 1:n) {
    score     <- 0.5 * (1:3) + b * family_income[i]
    p         <- softmax(score[1], score[2], score[3])
    career[i] <- sample(1:3, size = 1, prob = p)
}
```

In effect, we now have three data-generating equations:

\begin{align*}
s_1 & = 0.5 + -2 \cdot \text{family_income}_i \\
s_2 & = 1.0 +  0 \cdot \text{family_income}_i \\ 
s_3 & = 1.5 +  2 \cdot \text{family_income}_i,
\end{align*}

where, because `family_income` is an actual variable that can take on unique values for each row in the data, we can call the first term in each equation the $\alpha$ parameter and the second term in each equation the $\beta$ parameter AND those $\beta$ parameters will be more than odd double intercepts.

We might examine what the `family_income` distributions look like across the three levels of `career`. We'll do it in two plots and combine them with the **patchwork** syntax. The first will be overlapping densities. For the second, we'll display the proportions of `career` across a discretized version of `family_income` in a stacked area plot.

```{r, fig.width = 7, fig.height = 3}
# put the data in a tibble
d <-
  tibble(career = career) %>% 
  mutate(family_income = family_income)

p1 <-
  d %>% 
  mutate(career = as.factor(career)) %>% 
  
  ggplot(aes(x = family_income, fill = career)) +
  geom_density(size = 0, alpha = 3/4) +
  scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
  theme(legend.position = "none")
  
p2 <-
  d %>% 
  mutate(career = as.factor(career)) %>%
  
  mutate(fi = santoku::chop_width(family_income, width = .1, start = 0, labels = 1:10)) %>% 
  count(fi, career) %>% 
  group_by(fi) %>% 
  mutate(proportion = n / sum(n)) %>% 
  mutate(f = as.double(fi)) %>% 
  
  ggplot(aes(x = (f - 1) / 9, y = proportion, fill = career)) +
  geom_area() +
  scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
  xlab("family_income, descritized")

p1 + p2
```

Since Mcelreath's simulation code in McElreath's **R** code 11.59 did not contain a `set.seed()` line, it won't be possible to exactly reproduce his results. Happily, though, it appears that this time the results he reported in the text to cohere reasonably well with I ran the code on my computer. They weren't identical, but there were much closer that for `m11.13` from the last section. Since things are working more smoothly, here, I'm going to jump directly to **brms** code.

```{r b11.14}
b11.14 <-
  brm(data = d, 
      family = categorical(link = logit, refcat = 3),
      bf(career ~ 1,
         nlf(mu1 ~ a1 + b1 * family_income),
         nlf(mu2 ~ a2 + b2 * family_income),
         a1 + a2 + b1 + b2 ~ 1),
      prior = c(prior(normal(0, 1.5), class = b, nlpar = a1),
                prior(normal(0, 1.5), class = b, nlpar = a2),
                prior(normal(0, 1), class = b, nlpar = b1),
                prior(normal(0, 1), class = b, nlpar = b2)),
      iter = 2000, warmup = 1000, cores = 4, chains = 4,
      seed = 11,
      file = "fits/b11.14")
```

```{r}
print(b11.14)
```

Check the PSIS-LOO.

```{r, warning = F, message = F}
b11.14 <- add_criterion(b11.14, "loo")

loo(b11.14)
```

Now that we actually have predictor variables with which we might estimate conventional $\beta$ parameters, we finally have more than 2 effective parameters ($p_\text{LOO}$).

"Again, computing implied predictions is the safest way to interpret these models. They do a great job of classifying discrete, unordered events. But the parameters are on a scale that is very hard to interpret" (p. 325). Like before, we'll do that with `fitted()`. Now we have a predictor, this time we will use the `newdata` argument.

```{r}
nd <- tibble(family_income = seq(from = 0, to = 1, length.out = 60))

f <-
  fitted(b11.14,
         newdata = nd)
```

First we'll plot the fitted probabilities for each `career` level across the full range of `family_income` values.

```{r, fig.width = 7, fig.height = 2.75}
# wrangle
rbind(f[, , 1],
      f[, , 2],
      f[, , 3]) %>% 
  data.frame() %>% 
  bind_cols(nd %>% expand(career = 1:3, family_income)) %>% 
  mutate(career = str_c("career: ", career)) %>% 
  
  # plot
  ggplot(aes(x = family_income, y = Estimate,
             ymin = Q2.5, ymax = Q97.5,
             fill = career, color = career)) +
  geom_ribbon(alpha = 2/3, size = 0) +
  geom_line(size = 3/4) +
  scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
  scale_color_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
  scale_x_continuous(breaks = 0:2 / 2) +
  scale_y_continuous("probability", limits = c(0, 1),
                     breaks = 0:3 / 3, labels = c("0", ".33", ".67", "1")) +
  theme(axis.text.y = element_text(hjust = 0),
        legend.position = "none") +
  facet_wrap(~ career)
```

If we're willing to summarize those fitted lines by their posterior means, we could also make a model-implied version of the stacked area plot from above.

```{r, fig.width = 3.25, fig.height = 3}
# annotation
text <-
  tibble(family_income = c(.45, .3, .15),
         proportion    = c(.65, .8, .95),
         label         = str_c("career: ", 3:1),
         color         = c("a", "a", "b"))

# wrangle
rbind(f[, , 1],
      f[, , 2],
      f[, , 3]) %>% 
  data.frame() %>% 
  bind_cols(nd %>% expand(career = 1:3, family_income)) %>% 
  group_by(family_income) %>% 
  mutate(proportion = Estimate / sum(Estimate),
         career     = factor(career)) %>% 
  
  # plot!
  ggplot(aes(x = family_income, y = proportion)) +
  geom_area(aes(fill = career)) +
  geom_text(data = text,
            aes(label = label, color = color),
            family = "Times", size = 4.25) +
  scale_color_manual(values = wes_palette("Moonrise2")[4:3]) +
  scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 2, 1)]) +
  theme(legend.position = "none")
```

For more practice fitting multinomial models with **brms**, check out [Chapter 22](https://bookdown.org/content/3686/nominal-predicted-variable.html) of my [-@kurzDoingBayesianDataAnalysis2022] translation of Kruschke's [-@kruschkeDoingBayesianData2015] text.

#### Multinomial in disguise as Poisson.

Here we fit a multinomial likelihood by refactoring it to a series of Poissons. Let's retrieve the Berkeley data.

```{r, warning = F, message = F}
data(UCBadmit, package = "rethinking")
d <- UCBadmit
rm(UCBadmit)
```

Fit the models.

```{r b11.binom}
# binomial model of overall admission probability
b11.binom <-
  brm(data = d, 
      family = binomial,
      admit | trials(applications) ~ 1,
      prior(normal(0, 1.5), class = Intercept),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/b11.binom")

# Poisson model of overall admission rate and rejection rate
b11.pois <-
  brm(data = d %>%
        mutate(rej = reject),  # 'reject' is a reserved word
      family = poisson,
      mvbind(admit, rej) ~ 1,
      prior(normal(0, 1.5), class = Intercept),
      iter = 2000, warmup = 1000, cores = 3, chains = 3,
      seed = 11,
      file = "fits/b11.pois")
```

Note, the `mvbind()` syntax made `b11.pois` a multivariate Poisson model. Starting with version 2.0.0, **brms** supports a variety of multivariate models, which you might learn more about with Bürkner's [-@Bürkner2022Multivariate] vignette, [*Estimating multivariate models with brms*](https://cran.r-project.org/package=brms/vignettes/brms_multivariate.html). Anyway, here are the implications of `b11.pois`.

```{r, fig.height = 2.5, fig.width = 6.5, warning = F}
# extract the samples
post <- as_draws_df(b11.pois)
# wrangle
post %>%
  mutate(admit  = exp(b_admit_Intercept), 
         reject = exp(b_rej_Intercept)) %>% 
  pivot_longer(admit:reject) %>% 
  
  # plot
  ggplot(aes(x = value, y = name, fill = name)) +
  stat_halfeye(point_interval = median_qi, .width = .95,
               color = wes_palette("Moonrise2")[4]) +
  scale_fill_manual(values = wes_palette("Moonrise2")[1:2]) +
  labs(title = " Mean admit/reject rates across departments",
       x = "# applications",
       y = NULL) +
  theme(axis.ticks.y = element_blank(),
        legend.position = "none")
```

We might compare the model summaries.

```{r}
summary(b11.binom)$fixed
summary(b11.pois)$fixed
```

Here's the posterior mean for the probability of admission, based on `b11.binom`.

```{r}
fixef(b11.binom)[, "Estimate"] %>%
  inv_logit_scaled()
```

Happily, we get the same value within simulation error from model `b11.pois`.

```{r}
k <- 
  fixef(b11.pois) %>%
  as.numeric()

exp(k[1]) / (exp(k[1]) + exp(k[2]))
```

The formula for what we just did in code is

$$p_\text{admit} = \frac{\lambda_1}{\lambda_1 + \lambda_2} = \frac{\exp (\alpha_1)}{\exp (\alpha_1) + \exp (\alpha_2)}.$$

To get a better appreciation on how well the two model types converge on the same solution, we might plot the full poster for admissions probability from each.

```{r, fig.width = 4.5, fig.height = 2.5, warning = F, message = F}
# wrangle
bind_cols(
  as_draws_df(b11.pois) %>% 
    transmute(`the Poisson`  = exp(b_admit_Intercept) / (exp(b_admit_Intercept) + exp(b_rej_Intercept))),
  as_draws_df(b11.binom) %>% 
    transmute(`the binomial` = inv_logit_scaled(b_Intercept))
  ) %>% 
  pivot_longer(everything()) %>% 
  
  # plot
  ggplot(aes(x = value, y = name, fill = name)) +
  stat_halfeye(point_interval = median_qi, .width = c(.95, .5),
               color = wes_palette("Moonrise2")[4]) +
  scale_fill_manual(values = c(wes_palette("Moonrise2")[2:1])) +
  labs(title = "Two models, same marginal posterior",
       x = "admissions probability",
       y = NULL) +
  coord_cartesian(ylim = c(1.5, 2.33)) +
  theme(axis.text.y = element_text(hjust = 0),
        axis.ticks.y = element_blank(),
        legend.position = "none")
```

## Summary

> This chapter described some of the most common generalized linear models, those used to model counts. It is important to never convert counts to proportions before analysis, because doing so destroys information about sample size. A fundamental difficulty with these models is that parameters are on a different scale, typically log-odds (for binomial) or log-rate (for Poisson), than the outcome variable they describe. Therefore computing implied predictions is even more important than before. (p. 365)

## Bonus: Survival analysis

In the middle of the [thirteenth lecture of his 2019 lecture series](https://youtu.be/p7g-CgGCS34?t=1423), McElreath briefly covered continuous-time survival analysis. Sadly, the problem didn't make it into the text. Here we'll slip it in as a bonus section. To fully understand this section, do listen to this section of the lecture. It's only about ten minutes.

```{r, echo = F}
vembedr::embed_url("https://youtu.be/p7g-CgGCS34?t=1423") %>%
  vembedr::use_align("center")
```

Now let's load the `AustinCats` data.

```{r}
data(AustinCats, package = "rethinking")
d <- AustinCats
rm(AustinCats)

glimpse(d)
```

At the moment, it doesn't look like the **rethinking** package contains documentation about the `AustinCats`. Based on McElreath's lecture, he downloaded them from the website of an animal shelter in Austin, TX. We have data on 22,356 cats on whether they were adopted and how long it took. The cats came in a variety of colors. Here are the first ten.

```{r}
d %>% 
  count(color) %>% 
  slice(1:10)
```

McElreath wondered whether it took longer for black cats to be adopted. If you look at the `color` categories, above, you'll see the people doing the data entry were creative with their descriptions. To keep things simple, we'll just be comparing cats for whom `color == "Black"` to all the others.

```{r, warning = F, fig.width = 5.5, fig.height = 2}
d <-
  d %>% 
  mutate(black = ifelse(color == "Black", "black", "other"))

d %>% 
  count(black) %>% 
  mutate(percent = 100 * n / sum(n)) %>% 
  mutate(label = str_c(round(percent, digits = 1), "%")) %>% 
  
  ggplot(aes(y = black)) +
  geom_col(aes(x = n, fill = black)) +
  geom_text(aes(x = n - 250, label = label),
            color = wes_palette("Moonrise2")[3], family = "Times", hjust = 1) +
  scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 1)], breaks = NULL) +
  scale_x_continuous(expression(italic(n)), breaks = c(0, count(d, black) %>% pull(n))) +
  labs(title = "Cat color",
       y = NULL) +
  theme(axis.ticks.y = element_blank())
```

Another variable we need to consider is the `out_event`.

```{r}
d %>% 
  count(out_event)
```

Happily, most of the cats had `Adoption` as their `out_event`. For our purposes, all of the other options are the same as if they were `Censored`. We'll make a new variable to indicate that.

```{r}
d <-
  d %>% 
  mutate(adopted  = ifelse(out_event == "Adoption", 1, 0),
         censored = ifelse(out_event != "Adoption", 1, 0))

glimpse(d)
```

Here's what the distribution of `days_to_event` looks like, when grouped by our new `censored` variable.

```{r, fig.width = 6, fig.height = 3}
d %>% 
  mutate(censored = factor(censored)) %>% 
  filter(days_to_event < 300) %>% 
  
  ggplot(aes(x = days_to_event, y = censored)) +
  # let's just mark off the 50% intervals
  stat_halfeye(.width = .5, fill = wes_palette("Moonrise2")[2], height = 4) +
  scale_y_discrete(NULL, labels = c("censored == 0", "censored == 1")) +
  coord_cartesian(ylim = c(1.5, 5.1)) +
  theme(axis.ticks.y = element_blank())
```

Do note there is a very long right tail that we've cut off for the sake of the plot. Anyway, the point of this plot is to show that the distribution for our primary variable, `days_to_event`, looks very different conditional on whether the data were censored. As McElreath covered in the lecture, we definitely don't want to loose that information by excluding the censored cases from the analysis.

McElreath fit his survival model using the exponential likelihood. We briefly met the exponential likelihood in [Chapter 10][Big Entropy and the Generalized Linear Model]. As McElreath wrote:

> It is a fundamental distribution of distance and duration, kinds of measurements that represent displacement from some point of reference, either in time or space. If the probability of an event is constant in time or across space, then the distribution of events tends towards exponential. The exponential distribution has maximum entropy among all non-negative continuous distributions with the same average displacement. (p. 314)

If we let $y$ be a non-negative continuous variable, the probability density function for the exponential distribution is

$$f(y) = \lambda e^{-\lambda y},$$

where $\lambda$ is called the rate. The mean of the exponential distribution is the inverse of the rate

$$\operatorname{E}[y] = \frac{1}{\lambda}.$$

Importantly, **brms** paramaterizes exponential models in terms of $\operatorname{E}[y]$. By default, it uses the log link. The is the same set-up McElreath used for **rethinking** in his lecture. To get a sense of how this all works, we can write our continuous-time survival model as

\begin{align*}
\text{days_to_event}_i | \text{censored}_i = 0 & \sim \operatorname{Exponential}(\lambda_i) \\
\text{days_to_event}_i | \text{censored}_i = 1 & \sim \operatorname{Exponential-CCDF}(\lambda_i) \\
\lambda_i & = 1 / \mu_i \\
\log \mu_i & = \alpha_{\text{black}[i]} \\
\alpha & \sim \operatorname{Normal}(0, 1).
\end{align*}

This is the same model McElreath discussed in the lecture. We've just renamed a couple variables. When you fit a continuous-time survival analysis with `brm()`, you'll want to tell the software about how the data have been censored with help from the `cens()` function. For many of the models in this chapter, we used the `trials()` function to include the $n_i$ information into our binomial models. Both `trials()` and `cens()` are members of a class of functions designed to provide supplemental information about our criterion variables to `brm()`. The `cens()` function lets us add in information about censoring. In his lecture, McElreath mentioned there can be different kinds of censoring. **brms** can handle variables with left, right, or interval censoring. In the case of our `days_to_event` data, some of the values have been right censored, which is typical in survival models. We will feed this information into the model with the `formula` code `days_to_event | cens(censored)`, where `censored` is the name of the variable in our data that indexes the censoring. The `cens()` function has been set up to expect our data to be coded as either

*  `'left'`, `'none'`, `'right'`, and/or `'interval'`; or
* `-1`, `0`, `1`, and/or `2`.

Since we coded our `censored` variable as *censored* = 1 and *not censored* = 0, we have followed the second coding scheme. For more on the topic, see the `Additional response information` subsection within the `brmsformula` section of [**brms** reference manual](https://cran.r-project.org/package=brms/brms.pdf) [@brms2022RM]. Here's how to fit our survival model with **brms**.

```{r b11.15}
b11.15 <-
  brm(data = d,
      family = exponential,
      days_to_event | cens(censored) ~ 0 + black,
      prior(normal(0, 1), class = b),
      iter = 2000, warmup = 1000, chains = 4, cores = 4,
      seed = 11,
      file = "fits/b11.15")
```

Check the summary.

```{r}
print(b11.15)
```

Since we modeled $\log \mu_i$, we need to transform our $\alpha$ parameters back into the $\lambda$ metric using the formula

\begin{align*}
\log \mu             & = \alpha_\text{black}, && \text{and} \\
\lambda              & = 1 / \mu,             && \text{therefore} \\
\lambda_\text{black} & = 1 / \exp(\alpha_\text{black}).
\end{align*}

Here are the posterior means for our two $\lambda$'s.

```{r}
1 / exp(fixef(b11.15)[, -2])
```

It still might not be clear what any of this all means. To get a better sense, let's make our version of one of the plots from McElreath's lecture.

```{r, fig.width = 5, fig.height = 3}
# annotation
text <-
  tibble(color = c("black", "other"),
         days  = c(40, 34),
         p     = c(.55, .45),
         label = c("black cats", "other cats"),
         hjust = c(0, 1))

# wrangle
f <-
  fixef(b11.15) %>% 
  data.frame() %>% 
  rownames_to_column() %>% 
  mutate(color = str_remove(rowname, "black")) %>% 
  expand(nesting(Estimate, Q2.5, Q97.5, color),
         days = 0:100) %>% 
  mutate(m  = 1 - pexp(days, rate = 1 / exp(Estimate)),
         ll = 1 - pexp(days, rate = 1 / exp(Q2.5)),
         ul = 1 - pexp(days, rate = 1 / exp(Q97.5)))
  
# plot!
f %>% 
  ggplot(aes(x = days)) +
  geom_hline(yintercept = .5, linetype = 3, color = wes_palette("Moonrise2")[2]) +
  geom_ribbon(aes(ymin = ll, ymax = ul, fill = color),
              alpha = 1/2) +
  geom_line(aes(y = m, color = color)) +
  geom_text(data = text,
            aes(y = p, label = label, hjust = hjust, color = color),
            family = "Times") +
  scale_fill_manual(values = wes_palette("Moonrise2")[c(4, 1)], breaks = NULL) +
  scale_color_manual(values = wes_palette("Moonrise2")[c(4, 1)], breaks = NULL) +
  scale_y_continuous("proportion remaining", breaks = c(0, .5, 1), limits = 0:1) +
  xlab("days to adoption")
```

McElreath's hypothesis is correct: Black cats are adopted a lower rates than cats of other colors. Another way to explore this model is to ask: *About how many days would it take for half of the cats of a given color to be adopted?* We can do this with help from the `qexp()` function. For example:

```{r}
qexp(p = .5, rate = 1 / exp(fixef(b11.15)[1, 1]))
```

But that's just using one of the posterior means. Here's that information using the full posterior distributions for our two levels of `black`.

```{r, fig.width = 5, fig.height = 3, message = F, warning = F}
# wrangle
post <-
  as_draws_df(b11.15) %>% 
  pivot_longer(starts_with("b_")) %>% 
  mutate(color = str_remove(name, "b_black"),
         days  = qexp(p = .5, rate = 1 / exp(value))) 

# axis breaks
medians <-
  group_by(post, color) %>% 
  summarise(med = median(days)) %>% 
  pull(med) %>% 
  round(., digits = 1)

# plot!
post %>% 
  ggplot(aes(x = days, y = color)) +
  stat_halfeye(.width = .95, fill = wes_palette("Moonrise2")[2], height = 4) +
  scale_x_continuous("days untill 50% are adopted",
                     breaks = c(30, medians, 45), labels = c("30", medians, "45"),
                     limits = c(30, 45)) +
  ylab(NULL) +
  coord_cartesian(ylim = c(1.5, 5.1)) +
  theme(axis.ticks.y = element_blank())
```

The model suggests it takes about six days longer for the half of the black cats to be adopted.

### Survival summary.

We've really just scratched the surface on survival models. In addition to those which use the exponential likelihood, **brms** supports a variety of survival models. Some of the more popular likelihoods are the log-Normal, the gamma, and the Weibull. For details, see the [*Time-to-event models*](https://cran.r-project.org/web/packages/brms/vignettes/brms_families.html#time-to-event-models) section of Bürkner's [-@Bürkner2022Parameterization] vignette, [*Parameterization of response distributions in brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_families.html). Starting with the release of [version 2.13.5](https://github.com/paul-buerkner/brms/blob/master/NEWS.md#brms-2135), **brms** now supports the Cox proportional hazards model via `family = cox`. If you're tricky with your coding, you can also fit discrete-time survival models with the binomial likelihood (see [here](https://bookdown.org/content/4253/fitting-basic-discrete-time-hazard-models.html)). For some examples of discrete and continuous-time survival models, you might check out my [-@kurzAppliedLongitudinalDataAnalysis2021] ebook translation of Singer and Willett's [-@singerAppliedLongitudinalData2003] text, [*Applied longitudinal data analysis: Modeling change and event occurrence*](https://oxford.universitypressscholarship.com/view/10.1093/acprof:oso/9780195152968.001.0001/acprof-9780195152968), the later chapters of which provide an exhaustive introduction to survival analysis.

## Session info {-}

```{r}
sessionInfo()
```

```{r, echo = F}
rm(d, b11.1, b11.1b, b11.2, b11.3, prior, b11.4, post, tx, p1, nd, p2, b11.5, d_aggregated, b11.6, text, b11.7, p, b11.8, dag_coords, my_upper, my_diag, my_lower, n, b11.9, b11.10, cultures, f, b11.11, num_days, y, num_weeks, y_new, b11.12, income, score, career, i, code_m11.13, dat_list, m11.13, s1, s2_orig, s2_new, p_orig, p_new, p_diff, b11.13io, b11.13io_verbose, b11.13io_nonlinear, b11.13a, b11.13b, b11.13c, b11.13d, family_income, b, b11.14, b11.binom, b11.pois, k, b11.15, medians)
```

```{r, echo = F, message = F, warning = F, results = "hide"}
ggplot2::theme_set(ggplot2::theme_grey())
bayesplot::color_scheme_set("blue")
# pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```

[^4]: Though McElreath didn't cover it, here, it's also fine to fit binomial models using the probit link. @gelmanRegressionOtherStories2020 covered probit regression in Section 15.4. With **brms**, it's simply a matter of setting `family = binomial(link = "probit")` within `brm()`.