-
Notifications
You must be signed in to change notification settings - Fork 1
/
11.Rmd
2545 lines (2015 loc) · 110 KB
/
11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```
# God Spiked the Integers
> The most common and useful generalized linear models are models for counts. Counts
are non-negative integers--0, 1, 2, and so on. They are the basis of all mathematics, the first bits that children learn. But they are also intoxicatingly complicated to model--hence the apocryphal slogan that titles this chapter. The essential problem is this: When what we wish to predict is a count, the scale of the parameters is never the same as the scale of the outcome. A count golem, like a tide prediction engine, has a whirring machinery underneath that doesn't resemble the output. Keeping the tide engine in mind, you can master these models and use them responsibly.
>
> We will engineer complete examples of the two most common types of count model. **Binomial regression** is the name we'll use for a family of related procedures that all model a binary classification--alive/dead, accept/reject, left/right--for which the total of both categories is known. This is like the marble and globe tossing examples from [Chapter 2][Small Worlds and Large Worlds]. But now you get to incorporate predictor variables. **Poisson regression** is a GLM that models a count with an unknown maximum—number of elephants in Kenya, number of applications to a PhD program, number of significance tests in an issue of [*Psychological Science*](https://www.psychologicalscience.org/publications/psychological_science). As described in [Chapter 10][Big Entropy and the Generalized Linear Model], the Poisson model is a special case of binomial. At the end, the chapter describes some other count regressions. [@mcelreathStatisticalRethinkingBayesian2020, p. 323, **emphasis** in the original]
In this chapter, we focus on the two most common types of count models: the binomial and the Poisson.
## Binomial regression
The basic binomial model follows the form
$$y \sim \operatorname{Binomial}(n, p),$$
where $y$ is some count variable, $n$ is the number of trials, and $p$ it the probability a given trial was a 1, which is sometimes termed a *success*. When $n = 1$, then $y$ is a vector of 0's and 1's. Presuming the logit link[^4], which we just covered in [Chapter 10][Linking linear models to distributions.], models of this type are commonly termed logistic regression. When $n > 1$, and still presuming the logit link, we might call our model an aggregated logistic regression model, or more generally an aggregated binomial regression model.
### Logistic regression: Prosocial chimpanzees.
Load the @silkChimpanzeesAreIndifferent2005 `chimpanzees` data.
```{r, message = F, warning = F}
data(chimpanzees, package = "rethinking")
d <- chimpanzees
rm(chimpanzees)
```
The data include two experimental conditions, `prosoc_left` and `condition`, each of which has two levels. This results in four combinations.
```{r, message = F, warning = F}
library(tidyverse)
library(flextable)
d %>%
distinct(prosoc_left, condition) %>%
mutate(description = str_c("Two food items on ", c("right and no partner",
"left and no partner",
"right and partner present",
"left and partner present"))) %>%
flextable() %>%
width(width = c(1, 1, 4))
```
It would be conventional to include these two variables and their interaction using dummy variables. We're going to follow McElreath and use an index variable approach, instead. If you'd like to see what this would look like using the dummy variable approach, check out my [-@kurzStatisticalRethinkingBrms2020] [translation of the corresponding section](https://bookdown.org/content/3890/counting-and-classification.html#logistic-regression-prosocial-chimpanzees.) from McElreath's first [-@mcelreathStatisticalRethinkingBayesian2015] edition. For now, make the index, which we'll be saving as a factor.
```{r}
d <-
d %>%
mutate(treatment = factor(1 + prosoc_left + 2 * condition)) %>%
# this will come in handy, later
mutate(labels = factor(treatment,
levels = 1:4,
labels = c("r/n", "l/n", "r/p", "l/p")))
```
We can use the `dplyr::count()` function to get a sense of the distribution of the conditions in the data.
```{r}
d %>%
count(condition, treatment, prosoc_left)
```
Fire up **brms**.
```{r, message = F, warning = F}
library(brms)
```
We start with the simple intercept-only logistic regression model, which follows the statistical formula
\begin{align*}
\text{pulled_left}_i & \sim \operatorname{Binomial}(1, p_i) \\
\operatorname{logit}(p_i) & = \alpha \\
\alpha & \sim \operatorname{Normal}(0, w),
\end{align*}
where $w$ is the hyperparameter for $\sigma$ the value for which we have yet to choose. To start things off, we'll set $w = 10$, fit a model with where we set `sample_prior = TRUE`, and get a sense of the prior on a plot.
In the `brm()` `formula` syntax, including a `|` bar on the left side of a formula indicates we have extra supplementary information about our criterion. In this case, that information is that each `pulled_left` value corresponds to a single trial (i.e., `trials(1)`), which itself corresponds to the $n = 1$ portion of the statistical formula, above.
```{r b11.1}
b11.1 <-
brm(data = d,
family = binomial,
pulled_left | trials(1) ~ 1,
prior(normal(0, 10), class = Intercept),
seed = 11,
sample_prior = T,
file = "fits/b11.01")
```
Before we go any further, let's discuss the plot theme. For this chapter, we'll take our color scheme from the `"Moonrise2"` palette from the [**wesanderson** package](https://cran.r-project.org/package=wesanderson) [@R-wesanderson].
```{r, message = F, fig.width = 3, fig.height = 1}
library(wesanderson)
wes_palette("Moonrise2")
wes_palette("Moonrise2")[1:4]
```
We'll also take a few formatting cues from Edward Tufte [-@tufteVisualDisplayQuantitative2001], courtesy of the [**ggthemes package**](https://cran.r-project.org/package=ggthemes). The `theme_tufte()` function will change the default font and remove some chart junk. The `theme_set()` function, below, will make these adjustments the default for all subsequent **ggplot2** plots. To undo this, just execute `theme_set(theme_default())`.
```{r, message = F, warning = F}
library(ggthemes)
theme_set(
theme_default() +
theme_tufte() +
theme(plot.background = element_rect(fill = wes_palette("Moonrise2")[3],
color = wes_palette("Moonrise2")[3]))
)
```
Now we're ready to plot. We'll extract the prior draws with `prior_draws()`, convert them from the log-odds metric to the probability metric with the `brms::inv_logit_scaled()` function, and adjust the bandwidth of the density plot with the `adjust` argument within `geom_density()`.
```{r, fig.height = 2.75, fig.width = 3, warning = F, message = F}
prior_draws(b11.1) %>%
mutate(p = inv_logit_scaled(Intercept)) %>%
ggplot(aes(x = p)) +
geom_density(fill = wes_palette("Moonrise2")[4],
size = 0, adjust = 0.1) +
scale_y_continuous(NULL, breaks = NULL) +
xlab("prior prob pull left")
```
At this point in the analysis, we were only able to make part of the left panel of McElreath's Figure 11.3. We'll add to it in a bit. Now update the model so that $w = 1.5$.
```{r b11.1b}
b11.1b <-
brm(data = d,
family = binomial,
pulled_left | trials(1) ~ 1,
prior(normal(0, 1.5), class = Intercept),
seed = 11,
sample_prior = T,
file = "fits/b11.01b")
```
Now we can make the full version of the left panel of Figure 11.3.
```{r, fig.height = 2.75, fig.width = 3, warning = F, message = F}
# wrangle
bind_rows(prior_draws(b11.1),
prior_draws(b11.1b)) %>%
mutate(p = inv_logit_scaled(Intercept),
w = factor(rep(c(10, 1.5), each = n() / 2),
levels = c(10, 1.5))) %>%
# plot
ggplot(aes(x = p, fill = w)) +
geom_density(size = 0, alpha = 3/4, adjust = 0.1) +
scale_fill_manual(expression(italic(w)), values = wes_palette("Moonrise2")[c(4, 1)]) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = expression(alpha%~%Normal(0*", "*italic(w))),
x = "prior prob pull left")
```
If we'd like to fit a model that includes an overall intercept and uses McElreath'd index variable approach for the predictor variable `treatment`, we'll have to switch to the **brms** non-linear syntax. Here it is for the models using $w = 10$ and then $w = 0.5$.
```{r b11.2}
# w = 10
b11.2 <-
brm(data = d,
family = binomial,
bf(pulled_left | trials(1) ~ a + b,
a ~ 1,
b ~ 0 + treatment,
nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 10), nlpar = b, coef = treatment1),
prior(normal(0, 10), nlpar = b, coef = treatment2),
prior(normal(0, 10), nlpar = b, coef = treatment3),
prior(normal(0, 10), nlpar = b, coef = treatment4)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
sample_prior = T,
file = "fits/b11.02")
# w = 0.5
b11.3 <-
brm(data = d,
family = binomial,
bf(pulled_left | trials(1) ~ a + b,
a ~ 1,
b ~ 0 + treatment,
nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 0.5), nlpar = b, coef = treatment1),
prior(normal(0, 0.5), nlpar = b, coef = treatment2),
prior(normal(0, 0.5), nlpar = b, coef = treatment3),
prior(normal(0, 0.5), nlpar = b, coef = treatment4)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
sample_prior = T,
file = "fits/b11.03")
```
If all you want to do is fit the models, you wouldn't have to add a separate `prior()` statement for each level of `treatment`. You could have just included a single line, `prior(normal(0, 0.5), nlpar = b)`, that did not include a `coef` argument. The problem with this approach is we'd only get one column for `treatment` when using the `prior_draws()` function to retrieve the prior samples. To get separate columns for the prior samples of each of the levels of `treatment`, you need to take the verbose approach, above.
Anyway, here's how to make a version of the right panel of Figure 11.3.
```{r, fig.height = 2.75, fig.width = 3, warning = F, message = F}
# wrangle
prior <-
bind_rows(prior_draws(b11.2),
prior_draws(b11.3)) %>%
mutate(w = factor(rep(c(10, 0.5), each = n() / 2),
levels = c(10, 0.5)),
p1 = inv_logit_scaled(b_a + b_b_treatment1),
p2 = inv_logit_scaled(b_a + b_b_treatment2)) %>%
mutate(diff = abs(p1 - p2))
# plot
prior %>%
ggplot(aes(x = diff, fill = w)) +
geom_density(size = 0, alpha = 3/4, adjust = 0.1) +
scale_fill_manual(expression(italic(w)), values = wes_palette("Moonrise2")[c(4, 2)]) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = expression(alpha%~%Normal(0*", "*italic(w))),
x = "prior diff between treatments")
```
Here are the averages of the two prior-predictive difference distributions.
```{r, message = F}
prior %>%
group_by(w) %>%
summarise(mean = mean(diff))
```
Before we move on to fit the full model, it might be useful to linger here and examine the nature of the model we just fit. Here's the parameter summary for `b11.3`.
```{r}
print(b11.3)
```
Now focus on the likelihood portion of the model formula,
\begin{align*}
\text{pulled_left}_i & \sim \operatorname{Binomial}(1, p_i) \\
\operatorname{logit}(p_i) & = \alpha + \beta_\text{treatment} .
\end{align*}
When you have one overall intercept $\alpha$ and then use the non-linear approach for the `treatment` index, you end up with as many $\beta$ parameters as there levels for `treatment`. This means the formula for `treatment == 1` is $\alpha + \beta_{\text{treatment}[1]}$, the formula for `treatment == 2` is $\alpha + \beta_{\text{treatment}[2]}$, and so on. This also effectively makes $\alpha$ the grand mean. Here's the empirical grand mean.
```{r}
d %>%
summarise(grand_mean = mean(pulled_left))
```
Now here’s the summary of $\alpha$ after transforming it back into the probability metric with the `inv_logit_scaled()` function.
```{r, warning = F, message = F}
library(tidybayes)
as_draws_df(b11.3) %>%
transmute(alpha = inv_logit_scaled(b_a_Intercept)) %>%
mean_qi()
```
Here are the empirical probabilities for each of the four levels of `treatment`.
```{r, message = F}
d %>%
group_by(treatment) %>%
summarise(mean = mean(pulled_left))
```
Here are the corresponding posteriors.
```{r, warning = F}
as_draws_df(b11.3) %>%
pivot_longer(b_b_treatment1:b_b_treatment4) %>%
mutate(treatment = str_remove(name, "b_b_treatment"),
mean = inv_logit_scaled(b_a_Intercept + value)) %>%
group_by(treatment) %>%
mean_qi(mean)
```
Okay, let's get back on track with the text. Now we're ready to fit the full model, which follows the form
\begin{align*}
\text{pulled_left}_i & \sim \operatorname{Binomial}(1, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\color{#54635e}{\text{actor}}[i]} + \beta_{\color{#a4692f}{\text{treatment}}[i]} \\
\alpha_{\color{#54635e}j} & \sim \operatorname{Normal}(0, 1.5) \\
\beta_{\color{#a4692f}k} & \sim \operatorname{Normal}(0, 0.5).
\end{align*}
Before fitting the model, we should save `actor` as a factor.
```{r}
d <-
d %>%
mutate(actor = factor(actor))
```
Now fit the model.
```{r b11.4}
b11.4 <-
brm(data = d,
family = binomial,
bf(pulled_left | trials(1) ~ a + b,
a ~ 0 + actor,
b ~ 0 + treatment,
nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 0.5), nlpar = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.04")
```
Inspect the parameter summary.
```{r}
print(b11.4)
```
Here's how we might make our version of McElreath's coefficient plot of the $\alpha$ parameters.
```{r, fig.width = 4.5, fig.height = 1.33, warning = F}
library(tidybayes)
post <- as_draws_df(b11.4)
post %>%
pivot_longer(contains("actor")) %>%
mutate(probability = inv_logit_scaled(value),
actor = factor(str_remove(name, "b_a_actor"),
levels = 7:1)) %>%
ggplot(aes(x = probability, y = actor)) +
geom_vline(xintercept = .5, color = wes_palette("Moonrise2")[1], linetype = 3) +
stat_pointinterval(.width = .95, size = 1/2,
color = wes_palette("Moonrise2")[4]) +
scale_x_continuous(expression(alpha[actor]), limits = 0:1) +
ylab(NULL) +
theme(axis.ticks.y = element_blank())
```
Here's the corresponding coefficient plot of the $\beta$ parameters.
```{r, fig.width = 3.5, fig.height = 1, warning = F}
tx <- c("R/N", "L/N", "R/P", "L/P")
post %>%
select(contains("treatment")) %>%
set_names("R/N","L/N","R/P","L/P") %>%
pivot_longer(everything()) %>%
mutate(probability = inv_logit_scaled(value),
treatment = factor(name, levels = tx)) %>%
mutate(treatment = fct_rev(treatment)) %>%
ggplot(aes(x = value, y = treatment)) +
geom_vline(xintercept = 0, color = wes_palette("Moonrise2")[2], linetype = 3) +
stat_pointinterval(.width = .95, size = 1/2,
color = wes_palette("Moonrise2")[4]) +
labs(x = expression(beta[treatment]),
y = NULL) +
theme(axis.ticks.y = element_blank())
```
Now make the coefficient plot for the primary contrasts of interest.
```{r, fig.width = 3, fig.height = .8, warning = F}
post %>%
mutate(db13 = b_b_treatment1 - b_b_treatment3,
db24 = b_b_treatment2 - b_b_treatment4) %>%
pivot_longer(db13:db24) %>%
mutate(diffs = factor(name, levels = c("db24", "db13"))) %>%
ggplot(aes(x = value, y = diffs)) +
geom_vline(xintercept = 0, color = wes_palette("Moonrise2")[2], linetype = 3) +
stat_pointinterval(.width = .95, size = 1/2,
color = wes_palette("Moonrise2")[4]) +
labs(x = "difference",
y = NULL) +
theme(axis.ticks.y = element_blank())
```
"These are the contrasts between the no-partner/partner treatments" (p. 331). Next, we prepare for the posterior predictive check. McElreath showed how to compute empirical proportions by the levels of `actor` and `treatment` with the `by()` function. Our approach will be with a combination of `group_by()` and `summarise()`. Here's what that looks like for `actor == 1`.
```{r, message = F}
d %>%
group_by(actor, treatment) %>%
summarise(proportion = mean(pulled_left)) %>%
filter(actor == 1)
```
Now we'll follow that through to make the top panel of Figure 11.4. Instead of showing the plot, we'll save it for the next code block.
```{r, message = F}
p1 <-
d %>%
group_by(actor, treatment) %>%
summarise(proportion = mean(pulled_left)) %>%
left_join(d %>% distinct(actor, treatment, labels, condition, prosoc_left),
by = c("actor", "treatment")) %>%
mutate(condition = factor(condition)) %>%
ggplot(aes(x = labels, y = proportion)) +
geom_hline(yintercept = .5, color = wes_palette("Moonrise2")[3]) +
geom_line(aes(group = prosoc_left),
size = 1/4, color = wes_palette("Moonrise2")[4]) +
geom_point(aes(color = condition),
size = 2.5, show.legend = F) +
labs(subtitle = "observed proportions")
```
Next we use `brms()` fitted to get the posterior predictive distributions for each unique combination of `actor` and `treatment`, wrangle, and plot. First, we save the plot as `p2` and then we use **patchwork** syntax to combine the two subplots.
```{r, fig.width = 7, fig.height = 4.5}
nd <-
d %>%
distinct(actor, treatment, labels, condition, prosoc_left)
p2 <-
fitted(b11.4,
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
mutate(condition = factor(condition)) %>%
ggplot(aes(x = labels, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
geom_hline(yintercept = .5, color = wes_palette("Moonrise2")[3]) +
geom_line(aes(group = prosoc_left),
size = 1/4, color = wes_palette("Moonrise2")[4]) +
geom_pointrange(aes(color = condition),
fatten = 2.5, show.legend = F) +
labs(subtitle = "posterior predictions")
# combine the two ggplots
library(patchwork)
(p1 / p2) &
scale_color_manual(values = wes_palette("Moonrise2")[c(2:1)]) &
scale_y_continuous("proportion left lever",
breaks = c(0, .5, 1), limits = c(0, 1)) &
xlab(NULL) &
theme(axis.ticks.x = element_blank(),
panel.background = element_rect(fill = alpha("white", 1/10), size = 0)) &
facet_wrap(~ actor, nrow = 1, labeller = label_both)
```
Let's make two more index variables.
```{r}
d <-
d %>%
mutate(side = factor(prosoc_left + 1), # right 1, left 2
cond = factor(condition + 1)) # no partner 1, partner 2
```
Now fit the model without the interaction between `prosoc_left` and `condition`.
```{r b11.5}
b11.5 <-
brm(data = d,
family = binomial,
bf(pulled_left | trials(1) ~ a + bs + bc,
a ~ 0 + actor,
bs ~ 0 + side,
bc ~ 0 + cond,
nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 0.5), nlpar = bs),
prior(normal(0, 0.5), nlpar = bc)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.05")
```
Compare `b11.4` and `b11.5` by the PSIS-LOO and the WAIC.
```{r, warning = F, message = F}
b11.4 <- add_criterion(b11.4, c("loo", "waic"))
b11.5 <- add_criterion(b11.5, c("loo", "waic"))
loo_compare(b11.4, b11.5, criterion = "loo") %>% print(simplify = F)
loo_compare(b11.4, b11.5, criterion = "waic") %>% print(simplify = F)
```
Here are the weights.
```{r}
model_weights(b11.4, b11.5, weights = "loo") %>% round(digits = 2)
model_weights(b11.4, b11.5, weights = "waic") %>% round(digits = 2)
```
Here's a quick check of the parameter summary for the non-interaction model, `b11.5`.
```{r}
print(b11.5)
```
Because it's good practice, here's the `b11.5` version of the bottom panel of Figure 11.4.
```{r, fig.width = 7, fig.height = 2.2}
nd <-
d %>%
distinct(actor, treatment, labels, cond, side)
fitted(b11.5,
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = labels, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
geom_hline(yintercept = .5, color = wes_palette("Moonrise2")[3]) +
geom_line(aes(group = side),
size = 1/4, color = wes_palette("Moonrise2")[4]) +
geom_pointrange(aes(color = cond),
fatten = 2.5, show.legend = F) +
scale_color_manual(values = wes_palette("Moonrise2")[c(2:1)]) +
scale_y_continuous("proportion left lever",
breaks = c(0, .5, 1), limits = c(0, 1)) +
labs(subtitle = "posterior predictions for b11.5",
x = NULL) +
theme(axis.ticks.x = element_blank(),
panel.background = element_rect(fill = alpha("white", 1/10), size = 0)) +
facet_wrap(~ actor, nrow = 1, labeller = label_both)
```
#### Overthinking: Adding log-probability calculations to a Stan model.
For retrieving log-probability summaries, our approach with **brms** is a little different than the one you might take with McElreath's **rethinking**. Rather than adding a `log_lik=TRUE` argument within `rethinking::ulam()`, we just use the `log_lik()` function after fitting a **brms** model. You may recall we already practiced this way back in [Section 7.2.4.1][Overthinking: Computing the lppd.]. Here's a quick example of what that looks like for `b11.5`.
```{r}
log_lik(b11.5) %>% str()
```
### Relative shark and absolute deer.
Based on the full model, `b11.4`, here's how you might compute the posterior mean and 95% intervals for the proportional odds of switching from `treatment == 2` to `treatment == 4`.
```{r}
as_draws_df(b11.4) %>%
mutate(proportional_odds = exp(b_b_treatment4 - b_b_treatment2)) %>%
mean_qi(proportional_odds)
```
> On average, the switch multiplies the odds of pulling the left lever by 0.92, an 8% reduction in odds. This is what is meant by proportional odds. The new odds are calculated by taking the old odds and multiplying them by the proportional odds, which is 0.92 in this example. (p. 336)
A limitation of relative measures measures like proportional odds is they ignore what you might think of as the reference or the baseline.
> Consider for example a rare disease which occurs in 1 per 10-million people. Suppose also that reading this textbook increased the odds of the disease 5-fold. That would mean approximately 4 more cases of the disease per 10-million people. So only 5-in-10-million chance now. The book is safe. (p. 336)
Here that is in code.
```{r}
tibble(disease_rate = 1/1e7,
fold_increase = 5) %>%
mutate(new_disease_rate = disease_rate * fold_increase)
```
The hard part, though, is that "neither absolute nor relative risk is sufficient for all purposes" (p. 337). Each provides its own unique perspective on the data. Again, welcome to applied statistics. `r emo::ji("man_shrugging")`
### Aggregated binomial: Chimpanzees again, condensed.
With the **tidyverse**, we can use `group_by()` and `summarise()` to achieve what McElreath did with `aggregate()`.
```{r, message = F}
d_aggregated <-
d %>%
group_by(treatment, actor, side, cond) %>%
summarise(left_pulls = sum(pulled_left)) %>%
ungroup()
d_aggregated %>%
head(n = 8)
```
To fit an aggregated binomial model with **brms**, we augment the `<criterion> | trials()` syntax where the value that goes in `trials()` is either a fixed number, as in this case, or variable in the data indexing $n$. Either way, at least some of those trials will have an $n > 1$. Here we'll use the hard-code method, just like McElreath did in the text.
```{r b11.6}
b11.6 <-
brm(data = d_aggregated,
family = binomial,
bf(left_pulls | trials(18) ~ a + b,
a ~ 0 + actor,
b ~ 0 + treatment,
nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 0.5), nlpar = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 11,
file = "fits/b11.06")
```
Check the posterior summary.
```{r}
print(b11.6)
```
It might be easiest to compare `b11.4` and `b11.6` with a coefficient plot.
```{r, fig.width = 7, fig.height = 2.5, warning = F}
# this is just for fancy annotation
text <-
tibble(value = c(1.4, 2.6),
name = "b_a_actor7",
fit = c("b11.6", "b11.4"))
# rope in the posterior draws and wrangle
bind_rows(as_draws_df(b11.4),
as_draws_df(b11.6)) %>%
mutate(fit = rep(c("b11.4", "b11.6"), each = n() / 2)) %>%
pivot_longer(b_a_actor1:b_b_treatment4) %>%
# plot
ggplot(aes(x = value, y = name, color = fit)) +
stat_pointinterval(.width = .95, size = 2/3,
position = position_dodge(width = 0.5)) +
scale_color_manual(values = wes_palette("Moonrise2")[2:1]) +
geom_text(data = text,
aes(label = fit),
family = "Times", position = position_dodge(width = 2.25)) +
labs(x = "posterior (log-odds scale)",
y = NULL) +
theme(axis.ticks.y = element_blank(),
legend.position = "none")
```
Did you catch our `position = position_dodge()` tricks? Try executing the plot without those parts of the code to get a sense of what they did. Now compute and save the PSIS-LOO estimates for the two models so we might compare them.
```{r, warning = F, message = F}
b11.4 <- add_criterion(b11.4, "loo")
b11.6 <- add_criterion(b11.6, "loo")
```
Here's how we might attempt the comparison.
```{r, eval = F}
loo_compare(b11.4, b11.6, criterion = "loo") %>% print(simplify = F)
```
Unlike with McElreath's `compare()` code in the text, `loo_compare()` wouldn't even give us the results. All we get is the warning message informing us that because these two models are not based on the same data, comparing them with the LOO is invalid and **brms** refuses to let us do it. We can, however, look at their LOO summaries separately.
```{r}
loo(b11.4)
loo(b11.6)
```
To understand what's going on, consider how you might describe six 1's out of nine trials in the aggregated form,
$$\text{Pr}(6|9, p) = \frac{6!}{6!(9 - 6)!} p^6 (1 - p)^{9 - 6}.$$
If we still stick with the same data, but this time re-express those as nine dichotomous data points, we now describe their joint probability as
$$\text{Pr}(1, 1, 1, 1, 1, 1, 0, 0, 0 | p) = p^6 (1 - p)^{9 - 6}.$$
Let's work this out in code.
```{r}
# deviance of aggregated 6-in-9
-2 * dbinom(6, size = 9, prob = 0.2, log = TRUE)
# deviance of dis-aggregated
-2 * sum(dbinom(c(1, 1, 1, 1, 1, 1, 0, 0, 0), size = 1, prob = 0.2, log = TRUE))
```
> But this difference is entirely meaningless. It is just a side effect of how we organized the data. The posterior distribution for the probability of success on each trial will end up the same, either way. (p. 339)
This is what our coefficient plot showed us, above. The posterior distribution was the same within simulation variance for `b11.4` and `b11.6`. Just like McElreath reported in the text, we also got a warning about high Pareto $k$ values from the aggregated binomial model, `b11.6`. To access the message and its associated table directly, we can feed the results of `loo()` into the `loo::pareto_k_table` function.
```{r}
loo(b11.6) %>%
loo::pareto_k_table()
```
> Before looking at the Pareto $k$ values, you might have noticed already that we didn't get a similar warning before in the disaggregated logistic models of the same data. Why not? Because when we aggregated the data by actor-treatment, we forced PSIS (and WAIC) to imagine cross-validation that leaves out all 18 observations in each actor-treatment combination. So instead of leave-one-out cross-validation, it is more like leave-eighteen-out. This makes some observations more influential, because they are really now 18 observations.
>
> What's the bottom line? If you want to calculate WAIC or PSIS, you should use a logistic regression data format, not an aggregated format. Otherwise you are implicitly assuming that only large chunks of the data are separable. (p. 340)
### Aggregated binomial: Graduate school admissions.
Load the infamous `UCBadmit` data [see @bickelSexBiasGraduate1975].
```{r, message = F, warning = F}
data(UCBadmit, package = "rethinking")
d <- UCBadmit
rm(UCBadmit)
d
```
Now compute our new index variable, `gid`. We'll also slip in a `case` variable that saves the row numbers as a factor. That'll come in handy later when we plot.
```{r}
d <-
d %>%
mutate(gid = factor(applicant.gender, levels = c("male", "female")),
case = factor(1:n()))
```
Note the difference in how we defined out `gid`. Whereas McElreath used numeral indices, we retained the text within an ordered factor. **brms** can handle either approach just fine. The advantage of the factor approach is it will be easier to understand the output. You'll see in just a bit.
The univariable logistic model with `male` as the sole predictor of `admit` follows the form
\begin{align*}
\text{admit}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\text{logit}(p_i) & = \alpha_{\text{gid}[i]} \\
\alpha_j & \sim \operatorname{Normal}(0, 1.5),
\end{align*}
where $n_i = \text{applications}_i$, the rows are indexed by $i$, and the two levels of $\text{gid}$ are indexed by $j$. Since we're only using our index variable `gid` to model two intercepts with no further complications, we don't need to use the verbose non-linear syntax to fit this model with **brms**.
```{r b11.7}
b11.7 <-
brm(data = d,
family = binomial,
admit | trials(applications) ~ 0 + gid,
prior(normal(0, 1.5), class = b),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.07")
```
```{r}
print(b11.7)
```
Our results are very similar to those in the text. But notice how our two rows have more informative row names than `a[1]` and `a[2]`. This is why you might consider using the ordered factor approach rather than using numeral indices.
Anyway, here we'll compute the difference score in two metrics and summarize them with a little help from `mean_qi()`.
```{r, warning = F, message = F}
as_draws_df(b11.7) %>%
mutate(diff_a = b_gidmale - b_gidfemale,
diff_p = inv_logit_scaled(b_gidmale) - inv_logit_scaled(b_gidfemale)) %>%
pivot_longer(contains("diff")) %>%
group_by(name) %>%
mean_qi(value, .width = .89)
```
**brms** doesn't have a convenience function that works quite like `rethinking::postcheck()`. But we have options, the most handy of which in this case is probably `predict()`.
```{r, fig.height = 2.75, fig.width = 4.5, message = F}
p <-
predict(b11.7) %>%
data.frame() %>%
bind_cols(d)
text <-
d %>%
group_by(dept) %>%
summarise(case = mean(as.numeric(case)),
admit = mean(admit / applications) + .05)
p %>%
ggplot(aes(x = case, y = admit / applications)) +
geom_pointrange(aes(y = Estimate / applications,
ymin = Q2.5 / applications ,
ymax = Q97.5 / applications),
color = wes_palette("Moonrise2")[1],
shape = 1, alpha = 1/3) +
geom_point(color = wes_palette("Moonrise2")[2]) +
geom_line(aes(group = dept),
color = wes_palette("Moonrise2")[2]) +
geom_text(data = text,
aes(y = admit, label = dept),
color = wes_palette("Moonrise2")[2],
family = "serif") +
scale_y_continuous("Proportion admitted", limits = 0:1) +
ggtitle("Posterior validation check") +
theme(axis.ticks.x = element_blank())
```
> Sometimes a fit this bad is the result of a coding mistake. In this case, it is not. The model did correctly answer the question we asked of it: *What are the average probabilities of admission for women and men, across all departments?* The problem in this case is that men and women did not apply to the same departments, and departments vary in their rates of admission. This makes the answer misleading....
>
> Instead of asking *"What are the average probabilities of admission for women and men across all departments?"* we want to ask *"What is the average difference in probability of admission between women and men within departments?"* (pp. 342--343, *emphasis* in the original).
The model better suited to answer that question follows the form
\begin{align*}
\text{admit}_i & \sim \operatorname{Binomial} (n_i, p_i) \\
\text{logit}(p_i) & = \alpha_{\text{gid}[i]} + \delta_{\text{dept}[i]} \\
\alpha_j & \sim \operatorname{Normal} (0, 1.5) \\
\delta_k & \sim \operatorname{Normal} (0, 1.5),
\end{align*}
where departments are indexed by $k$. To fit a model including two index variables like this in **brms**, we'll need to switch back to the non-linear syntax. Though if you'd like to see an analogous approach using conventional **brms** syntax, check out model `b10.9` in [Section 10.1.3](https://bookdown.org/content/3890/counting-and-classification.html#aggregated-binomial-graduate-school-admissions.) of my translation of McElreath's first edition.
```{r b11.8}
b11.8 <-
brm(data = d,
family = binomial,
bf(admit | trials(applications) ~ a + d,
a ~ 0 + gid,
d ~ 0 + dept,
nl = TRUE),
prior = c(prior(normal(0, 1.5), nlpar = a),
prior(normal(0, 1.5), nlpar = d)),
iter = 4000, warmup = 1000, cores = 4, chains = 4,
seed = 11,
file = "fits/b11.08")
```
```{r}
print(b11.8)
```
Like with the earlier model, here we compute the difference score for $\alpha$ in two metrics.
```{r, warning = F}
as_draws_df(b11.8) %>%
mutate(diff_a = b_a_gidmale - b_a_gidfemale,
diff_p = inv_logit_scaled(b_a_gidmale) - inv_logit_scaled(b_a_gidfemale)) %>%
pivot_longer(contains("diff")) %>%
group_by(name) %>%
mean_qi(value, .width = .89)
```
> Why did adding departments to the model change the inference about gender so much? The earlier figure gives you a hint--the rates of admission vary a lot across departments. Furthermore, women and men applied to different departments. Let's do a quick tabulation to show that: (p. 344)
Here's our **tidyverse**-style tabulation of the proportions of applicants in each department by `gid`.
```{r, warning = F, message = F}
d %>%
group_by(dept) %>%
mutate(proportion = applications / sum(applications)) %>%
select(dept, gid, proportion) %>%
pivot_wider(names_from = dept,
values_from = proportion) %>%
mutate_if(is.double, round, digits = 2)
```
To make it even easier to see, we'll depict it in a tile plot.
```{r, fig.width = 3, fig.height = 1}
d %>%
group_by(dept) %>%
mutate(proportion = applications / sum(applications)) %>%
mutate(label = round(proportion, digits = 2),
gid = fct_rev(gid)) %>%
ggplot(aes(x = dept, y = gid, fill = proportion, label = label)) +
geom_tile() +
geom_text(aes(color = proportion > .25),
family = "serif") +
scale_fill_gradient(low = wes_palette("Moonrise2")[4],
high = wes_palette("Moonrise2")[1],
limits = c(0, 1)) +
scale_color_manual(values = wes_palette("Moonrise2")[c(1, 4)]) +
scale_x_discrete(NULL, position = "top") +
ylab(NULL) +
theme(axis.text.y = element_text(hjust = 0),
axis.ticks = element_blank(),
legend.position = "none")
```
As it turns out, "The departments with a larger proportion of women applicants are also those with lower overall admissions rates" (p. 344). If we presume gender influences both choice of department and admission rates, we might depict that in a simple DAG where $G$ is applicant gender, $D$ is department, and $A$ is acceptance into grad school.
```{r, fig.width = 3, fig.height = 1.5, warning = F, message = F}
library(ggdag)
dag_coords <-
tibble(name = c("G", "D", "A"),
x = c(1, 2, 3),
y = c(1, 2, 1))
dagify(D ~ G,
A ~ D + G,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_text(color = wes_palette("Moonrise2")[4], family = "serif") +
geom_dag_edges(edge_color = wes_palette("Moonrise2")[4]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
```
Although our `b11.8` model did not contain a parameter corresponding to the $G \rightarrow D$ pathway, it did condition on both $G$ and $D$. If we make another Figure like 11.5, we'll see conditioning on both substantially improved the posterior predictive distribution.
```{r, fig.height = 2.75, fig.width = 4.5}
predict(b11.8) %>%
data.frame() %>%
bind_cols(d) %>%
ggplot(aes(x = case, y = admit / applications)) +
geom_pointrange(aes(y = Estimate / applications,
ymin = Q2.5 / applications ,
ymax = Q97.5 / applications),
color = wes_palette("Moonrise2")[1],
shape = 1, alpha = 1/3) +
geom_point(color = wes_palette("Moonrise2")[2]) +
geom_line(aes(group = dept),
color = wes_palette("Moonrise2")[2]) +
geom_text(data = text,
aes(y = admit, label = dept),
color = wes_palette("Moonrise2")[2],
family = "serif") +
scale_y_continuous("Proportion admitted", limits = 0:1) +
labs(title = "Posterior validation check",
subtitle = "Though imperfect, this model is a big improvement") +
theme(axis.ticks.x = element_blank())
```
Here's the DAG that proposes an unobserved confound, $U$, that might better explain the $D \rightarrow A$ pathway.
```{r, fig.width = 3, fig.height = 1.5}
dag_coords <-
tibble(name = c("G", "D", "A", "U"),
x = c(1, 2, 3, 3),
y = c(1, 2, 1, 2))
dagify(D ~ G + U,
A ~ D + G + U,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_point(x = 3, y = 2,
size = 5, color = wes_palette("Moonrise2")[2]) +
geom_dag_text(color = wes_palette("Moonrise2")[4], family = "serif") +
geom_dag_edges(edge_color = wes_palette("Moonrise2")[4]) +
scale_x_continuous(NULL, breaks = NULL) +
scale_y_continuous(NULL, breaks = NULL)
```
McElreath recommended we look at the `pairs()` plot to get a sense of how highly correlated the parameters in our `b11.8` model are. Why not get a little extra about it and use custom settings the upper triangle, the diagonal, and the lower triangle with a `GGally::ggpairs()` plot? First we save our custom settings.
```{r}
my_upper <- function(data, mapping, ...) {
# get the x and y data to use the other code
x <- eval_data_col(data, mapping$x)
y <- eval_data_col(data, mapping$y)
r <- unname(cor.test(x, y)$estimate)
rt <- format(r, digits = 2)[1]
tt <- as.character(rt)
# plot the cor value
ggally_text(
label = tt,
mapping = aes(),
size = 4,
color = wes_palette("Moonrise2")[4],
alpha = 4/5,
family = "Times") +
theme_void()
}
my_diag <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density(fill = wes_palette("Moonrise2")[2], size = 0) +
theme_void()
}
my_lower <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_point(color = wes_palette("Moonrise2")[1],
size = 1/10, alpha = 1/10) +
theme_void()
}
```
To learn more about the nature of the code for the `my_upper()` function, check out [Issue #139](https://github.com/ggobi/ggally/issues/139) in the [**GGally** GitHub repository](https://github.com/ggobi/ggally). Here is the plot.
```{r, fig.height = 5, fig.width = 5.5, warning = F, message = F}
library(GGally)
as_draws_df(b11.8) %>%
select(starts_with("b_")) %>%
set_names(c("alpha[male]", "alpha[female]", str_c("delta[", LETTERS[1:6], "]"))) %>%
ggpairs(upper = list(continuous = my_upper),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower),
labeller = "label_parsed") +
labs(title = "Model: b11.8",
subtitle = "The parameters are strongly correlated.") +
theme(strip.text = element_text(size = 11))
```
> Why might we want to over-parameterize the model? Because it makes it easier to assign priors. If we made one of the genders baseline and measured the other as a deviation from it, we would stumble into the issue of assuming that the acceptance rate for one of the genders is pre-data more uncertain than the other. This isn't to say that over-parameterizing a model is always a good idea. But it isn't a violation of any statistical principle. You can always convert the posterior, post sampling, to any alternative parameterization. The only limitation is whether the algorithm we use to approximate the posterior can handle the high correlations. In this case, it can. (p. 345)
#### Rethinking: Simpson's paradox is not a paradox.
> This empirical example is a famous one in statistical teaching. It is often used to illustrate a phenomenon known as **Simpson’s paradox**. Like most paradoxes, there is no violation of logic, just of intuition. And since different people have different intuition, Simpson's paradox means different things to different people. The poor intuition being violated in this case is that a positive association in the entire population should also hold within each department. (p. 345, **emphasis** in the original)
In my field of clinical psychology, Simpson's paradox is an important, if under-appreciated, phenomenon. If you're in the social sciences as well, I highly recommend spending more time thinking about it. To get you started, I blogged about it [here](https://solomonkurz.netlify.app/post/2019-10-09-individuals-are-not-small-groups-i-simpson-s-paradox/) and @kievitSimpsonParadoxPsychological2013 wrote a great tutorial paper called [*Simpson's paradox in psychological science: a practical guide*](https://doi.org/10.3389/fpsyg.2013.00513).
## Poisson regression
> When a binomial distribution has a very small probability of an event $p$ and a very large number of trials $N$, then it takes on a special shape. The expected value of a binomial distribution is just $Np$, and its variance is $Np(1 - p)$. But when $N$ is very large and $p$ is very small, then these are approximately the same. (p. 346)
Data of this kind are often called count data. Here we simulate some.
```{r}
set.seed(11)