-
Notifications
You must be signed in to change notification settings - Fork 39
/
15.Rmd
2244 lines (1774 loc) · 106 KB
/
15.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```
# Missing Data and Other Opportunities
For the opening example, we're playing with the conditional probability
$$
\Pr(\text{burnt down} | \text{burnt up}) = \frac{\Pr(\text{burnt up, burnt down})}{\Pr(\text{burnt up})}.
$$
Given McElreath's setup, it works out that
$$
\Pr(\text{burnt down} | \text{burnt up}) = \frac{1/3}{1/2} = \frac{2}{3}.
$$
We might express the math toward the bottom of page 489 in tibble form like this.
```{r, warning = F, message = F}
library(tidyverse)
p_pancake <- 1/3
(
d <-
tibble(pancake = c("BB", "BU", "UU"),
p_burnt = c(1, .5, 0)) %>%
mutate(p_burnt_up = p_burnt * p_pancake)
)
d %>%
summarise(`p (burnt_down | burnt_up)` = p_pancake / sum(p_burnt_up))
```
I understood McElreath's simulation (**R** code 15.1) better after breaking it apart. The first part of `sim_pancake()` takes one random draw from the integers 1, 2, and 3. It just so happens that if we set `set.seed(1)`, the code returns a 1.
```{r}
set.seed(1)
sample(x = 1:3, size = 1)
```
So here's what it looks like if we use seeds `2:11`.
```{r}
take_sample <- function(seed) {
set.seed(seed)
sample(x = 1:3, size = 1)
}
tibble(seed = 2:11) %>%
mutate(value_returned = map_dbl(seed, take_sample))
```
Each of those `value_returned` values stands for one of the three pancakes: 1 = BB, 2 = BU, and 3 = UU. In the next line, McElreath made slick use of a matrix to specify that. Here's what the matrix looks like.
```{r}
matrix(c(1, 1, 1, 0, 0, 0), nrow = 2, ncol = 3)
```
See how the three columns are identified as `[,1]`, `[,2]`, and `[,3]`? If, say, we wanted to subset the values in the second column, we'd execute
```{r}
matrix(c(1, 1, 1, 0, 0, 0), nrow = 2, ncol = 3)[, 2]
```
which returns a numeric vector.
```{r}
matrix(c(1, 1, 1, 0, 0, 0), nrow = 2, ncol = 3)[, 2] %>% str()
```
That `1 0` corresponds to the pancake with one burnt (i.e., 1) and one unburnt (i.e., 0) side. So when McElreath then executed `sample(sides)`, he randomly sampled from one of those two values. In the case of `pancake == 2`, he randomly sampled one the pancake with one burnt and one unburnt side. Had he sampled from `pancake == 1`, he would have sampled from the pancake with both sides burnt.
Going forward, let's amend McElreath's `sim_pancake()` function so it will take a `seed` argument, which will allow us to make the output reproducible.
```{r}
# simulate a `pancake` and return randomly ordered `sides`
sim_pancake <- function(seed) {
set.seed(seed)
pancake <- sample(x = 1:3, size = 1)
sides <- matrix(c(1, 1, 1, 0, 0, 0), nrow = 2, ncol = 3)[, pancake]
sample(sides)
}
```
Let's take this baby for a whirl.
```{r, echo = F}
# save(list = c("n_sim", "d"), file = "sims/15.sim_1.rda")
load("/Users/solomonkurz/Dropbox/Recoding Statistical Rethinking 2nd ed/sims/15.sim_1.rda")
```
```{r, eval = F}
# how many simulations would you like?
n_sim <- 1e4
d <-
tibble(seed = 1:n_sim) %>%
mutate(burnt = map(seed, sim_pancake)) %>%
unnest(burnt) %>%
mutate(side = rep(c("up", "down"), times = n() / 2))
```
Take a look at what we've done.
```{r}
head(d, n = 10)
```
Now we use `pivot_wider()` and `summarise()` to get the value we've been working for.
```{r}
d %>%
pivot_wider(names_from = side, values_from = burnt) %>%
summarise(`p (burnt_down | burnt_up)` = sum(up == 1 & down == 1) / (sum(up)))
```
The results are within rounding error of the ideal 2/3.
> Probability theory is not difficult mathematically. It is just counting. But it is hard to interpret and apply. Doing so often seems to require some cleverness, and authors have an incentive to solve problems in clever ways, just to show off. But we don't need that cleverness, if we ruthlessly apply conditional probability....
>
> In this chapter, [we'll] meet two commonplace applications of this assume-and-deduce strategy. The first is the incorporation of **measurement error** into our models. The second is the estimation of **missing data** through **Bayesian imputation**....
>
> In neither application do [we] have to intuit the consequences of measurement errors nor the implications of missing values in order to design the models. All [we] have to do is state your information about the error or about the variables with missing values. Logic does the rest. [@mcelreathStatisticalRethinkingBayesian2020, p. 490, **emphasis** in the original]
## Measurement error
Let's grab those `WaffleDivorce` data from back in [Chapter 5][Spurious associations].
```{r, message = F, warning = F}
data(WaffleDivorce, package = "rethinking")
d <- WaffleDivorce
rm(WaffleDivorce)
```
In anticipation of **R** code 15.3 and 15.5, wrangle the data a little.
```{r}
d <-
d %>%
mutate(D_obs = (Divorce - mean(Divorce)) / sd(Divorce),
D_sd = Divorce.SE / sd(Divorce),
M = (Marriage - mean(Marriage)) / sd(Marriage),
A = (MedianAgeMarriage - mean(MedianAgeMarriage)) / sd(MedianAgeMarriage),
M_obs = M,
M_sd = Marriage.SE / sd(Marriage))
```
For the plots in this chapter, we'll use the dark themes from the [**ggdark** package](https://CRAN.R-project.org/package=ggdark) [@R-ggdark]. Our primary theme will be `ggdark::dark_theme_bw()`. One way to use the `dark_theme_bw()` function is to make it part of the code for an individual plot, such as `ggplot() + geom_point() + dark_theme_bw()`. Another way is to make `dark_theme_bw()` the default setting with `ggplot2::theme_set()`. That will be our method.
```{r, message = F, warning = F}
library(ggdark)
theme_set(
dark_theme_bw() +
theme(legend.position = "none",
panel.grid = element_blank())
)
# to reset the default ggplot2 theme to its default parameters, execute both:
# ggplot2::theme_set(theme_gray())
# ggdark::invert_geom_defaults()
```
For the rest of our color palette, we'll use colors from the [**viridis** package](https://github.com/sjmgarnier/viridis) [@R-viridis], which provides a variety of colorblind-safe color palettes [see @rudisViridisColorPalettes2018].
```{r, message = F, warning = F}
library(viridis)
```
The `viridis_pal()` function gives a list of colors within a given palette. The colors in each palette fall on a spectrum. Within `viridis_pal()`, the `option` argument allows one to select a given spectrum, "C", in our case. The final parentheses, `()`, allows one to determine how many discrete colors one would like to break the spectrum up by. We'll choose 7.
```{r}
viridis_pal(option = "C")(7)
```
With a little data wrangling, we can put the colors of our palette in a tibble and display them in a plot.
```{r, fig.height = 2, fig.width = 4}
tibble(factor = "a",
number = factor(1:7),
color_number = str_c(1:7, ". ", viridis_pal(option = "C")(7))) %>%
ggplot(aes(x = factor, y = number)) +
geom_tile(aes(fill = number)) +
geom_text(aes(label = color_number, color = number %in% c("5", "6", "7"))) +
scale_color_manual(values = c("black", "white")) +
scale_fill_viridis(option = "C", discrete = T, direction = -1) +
scale_x_discrete(NULL, breaks = NULL, expand = c(0, 0)) +
scale_y_discrete(NULL, breaks = NULL, expand = c(0, 0)) +
ggtitle("Behold: viridis C!")
```
Now, let's make use of our custom theme and reproduce/reimagine Figure 15.1.a.
```{r}
color <- viridis_pal(option = "C")(7)[7]
p1 <-
d %>%
ggplot(aes(x = MedianAgeMarriage,
y = Divorce,
ymin = Divorce - Divorce.SE,
ymax = Divorce + Divorce.SE)) +
geom_pointrange(shape = 20, alpha = 2/3, color = color) +
labs(x = "Median age marriage" ,
y = "Divorce rate")
```
Notice how `viridis_pal(option = "C")(7)[7]` called the seventh color in the color scheme, `"#F0F921FF"`. For Figure 15.1.b, we'll select the sixth color in the palette by coding `viridis_pal(option = "C")(7)[6]`. We'll then combine the two subplots with **patchwork**.
```{r, fig.width = 7.25, fig.height = 3.5}
color <- viridis_pal(option = "C")(7)[6]
p2 <-
d %>%
ggplot(aes(x = log(Population),
y = Divorce,
ymin = Divorce - Divorce.SE,
ymax = Divorce + Divorce.SE)) +
geom_pointrange(shape = 20, alpha = 2/3, color = color) +
scale_y_continuous(NULL, breaks = NULL) +
xlab("log population")
library(patchwork)
p1 | p2
```
Just like in the text, our plot shows states with larger populations tend to have smaller measurement error. The relation between measurement error and `MedianAgeMarriage` is less apparent.
#### Rethinking: Generative thinking, Bayesian inference.
> Bayesian models are *generative*, meaning they can be used to simulate observations just as well as they can be used to estimate parameters. One benefit of this fact is that a statistical model can be developed by thinking hard about how the data might have arisen. This includes sampling and measurement, as well as the nature of the process we are studying. Then let Bayesian updating discover the implications. (p. 491, *emphasis* in the original)
### Error on the outcome.
Now make a DAG of our data with **ggdag**.
```{r, warning = F, message = F, fig.width = 3.5, fig.height = 1.5}
library(ggdag)
dag_coords <-
tibble(name = c("A", "M", "D", "Dobs", "eD"),
x = c(1, 2, 2, 3, 4),
y = c(2, 3, 1, 1, 1))
dagify(M ~ A,
D ~ A + M,
Dobs ~ D + eD,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("D", "eD"), "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(parse = T, label = c("A", "D", "M", expression(italic(e)[D]), expression(D[obs]))) +
geom_dag_edges(edge_colour = "#FCF9F0") +
scale_color_manual(values = c(viridis_pal(option = "C")(7)[2], "black")) +
dark_theme_void()
```
Note our use of the `dark_theme_void()` function. But more to the substance of the matter,
> there's a lot going on here. But we can proceed one step at a time. The left triangle of this DAG is the same system that we worked with back in [Chapter 5][Think before you regress.]. Age at marriage ($A$) influences divorce ($D$) both directly and indirectly, passing through marriage rate ($M$). Then we have the observation model. The true divorce rate $D$ cannot be observed, so it is circled as an unobserved node. However we do get to observe $D_\text{obs}$, which is a function of both the true rate $D$ and some unobserved error $e_\text{D}$. (p. 492)
To get a better sense of what we're about to do, imagine for a moment that each state's divorce rate is normally distributed with a mean of `Divorce` and standard deviation `Divorce.SE`. Those distributions would be like this.
```{r}
d %>%
mutate(Divorce_distribution = str_c("Divorce ~ Normal(", Divorce, ", ", Divorce.SE, ")")) %>%
select(Loc, Divorce_distribution) %>%
head()
```
> Here's how to define the error distribution for each divorce rate. For each observed value $D_{\text{OBS},i}$, there will be one parameter, $D_{\text{TRUE},i}$, defined by:
>
> $$D_{\text{OBS},i} \sim \operatorname{Normal}(D_{\text{TRUE},i}, D_{\text{SE},i})$$
>
> All this does is define the measurement $D_{\text{OBS},i}$ as having the specified Gaussian distribution centered on the unknown parameter $D_{\text{TRUE},i}$. So the above defines a probability for each State $i$'s observed divorce rate, given a known measurement error. (p. 493)
Our model will follow the form
\begin{align*}
\color{#5D01A6FF}{\text{Divorce}_{\text{OBS}, i}} & \color{#5D01A6FF}\sim \color{#5D01A6FF}{\operatorname{Normal}(\text{Divorce}_{\text{TRUE}, i}, \text{Divorce}_{\text{SE}, i})} \\
\color{#5D01A6FF}{\text{Divorce}_{\text{TRUE}, i}} & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu & = \alpha + \beta_1 \text A_i + \beta_2 \text M_i \\
\alpha & \sim \operatorname{Normal}(0, 0.2) \\
\beta_1 & \sim \operatorname{Normal}(0, 0.5) \\
\beta_2 & \sim \operatorname{Normal}(0, 0.5) \\
\sigma & \sim \operatorname{Exponential}(1).
\end{align*}
Fire up **brms**.
```{r, warning = F, message = F}
library(brms)
```
With **brms**, we accommodate measurement error in the criterion using the `mi()` syntax, following the general form `<response> | mi(<se_response>)`. This follows a missing data logic, resulting in Bayesian missing data imputation for the criterion values. The `mi()` syntax is based on the missing data capabilities for **brms**, which we will cover in greater detail in the second half of this chapter.
```{r b15.1}
# put the data into a `list()`
dlist <- list(
D_obs = d$D_obs,
D_sd = d$D_sd,
M = d$M,
A = d$A)
b15.1 <-
brm(data = dlist,
family = gaussian,
D_obs | mi(D_sd) ~ 1 + A + M,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 15,
# note this line
save_pars = save_pars(latent = TRUE),
file = "fits/b15.01")
```
Check the model summary.
```{r}
print(b15.1)
```
To return the summaries for the `D_true[i]` parameters, you might execute `posterior_summary(b15.1)` or `b15.1$fit`.
```{r}
posterior_summary(b15.1) %>%
round(digits = 2) %>%
data.frame()
```
Our rows `Yl[1]` through `Yl[50]` correspond to what **rethinking** named `D_true[1]` through `D_true[50]`. Here's the code for our Figure 15.2.a.
```{r, message = F, warning = F}
library(ggrepel)
states <- c("AL", "AR", "ME", "NH", "RI", "DC", "VT", "AK", "SD", "UT", "ID", "ND", "WY")
d_est <-
posterior_summary(b15.1) %>%
data.frame() %>%
rownames_to_column("term") %>%
mutate(D_est = Estimate) %>%
select(term, D_est) %>%
filter(str_detect(term, "Yl")) %>%
bind_cols(d)
color <- viridis_pal(option = "C")(7)[5]
p1 <-
d_est %>%
ggplot(aes(x = D_sd, y = D_est - D_obs)) +
geom_hline(yintercept = 0, linetype = 2, color = "white") +
geom_point(alpha = 2/3, color = color) +
geom_text_repel(data = . %>% filter(Loc %in% states),
aes(label = Loc),
size = 3, seed = 15, color = "white")
```
We'll use a little `as_draws_df()` + `expand_grid()` magic to help with our version of Figure 15.2.b.
```{r, warning = F, message = F}
library(tidybayes)
states <- c("AR", "ME", "RI", "ID", "WY", "ND", "MN")
color <- viridis_pal(option = "C")(7)[4]
p2 <-
as_draws_df(b15.1) %>%
expand_grid(A = seq(from = -3.5, to = 3.5, length.out = 50)) %>%
mutate(fitted = b_Intercept + b_A * A) %>%
ggplot(aes(x = A)) +
stat_lineribbon(aes(y = fitted),
.width = .95, size = 1/3, color = "grey50", fill = "grey20") +
geom_segment(data = d_est,
aes(xend = A, y = D_obs, yend = D_est),
linewidth = 1/5) +
geom_point(data = d_est,
aes(y = D_obs),
color = color) +
geom_point(data = d_est,
aes(y = D_est),
shape = 1, stroke = 1/3) +
geom_text_repel(data = d %>% filter(Loc %in% states),
aes(y = D_obs, label = Loc),
size = 3, seed = 15, color = "white") +
labs(x = "median age marriage (std)",
y = "divorce rate (std)") +
coord_cartesian(xlim = range(d$A),
ylim = range(d$D_obs))
```
Now combine the two ggplots and plot.
```{r, fig.width = 7.5, fig.height = 3.5, warning = F}
p1 | p2
```
If you look closely, our plot on the left is flipped relative to the one in the text. I'm pretty sure my code is correct, which leaves me to believe McElreath accidentally flipped the ordering in his code and made his $y$-axis 'D_obs - D_est.' Happily, our plot on the right matches up nicely with the one in the text.
### Error on both outcome and predictor.
Now we update the DAG to account for measurement error in the predictor.
```{r, warning = F, message = F, fig.width = 3.5, fig.height = 1.5}
dag_coords <-
tibble(name = c("A", "M", "Mobs", "eM", "D", "Dobs", "eD"),
x = c(1, 2, 3, 4, 2, 3, 4),
y = c(2, 3, 3, 3, 1, 1, 1))
dagify(M ~ A,
D ~ A + M,
Mobs ~ M + eM,
Dobs ~ D + eD,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("A", "Mobs", "Dobs"), "b", "a")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(parse = T, label = c("A", "D", "M",
expression(italic(e)[D]), expression(italic(e)[M]),
expression(D[obs]), expression(M[obs]))) +
geom_dag_edges(edge_colour = "#FCF9F0") +
scale_color_manual(values = c(viridis_pal(option = "C")(7)[2], "black")) +
dark_theme_void()
```
We will express this DAG in an augmented statistical model following the form
\begin{align*}
\text{Divorce}_{\text{OBS}, i} & \sim \operatorname{Normal}(\text{Divorce}_{\text{TRUE}, i}, \text{Divorce}_{\text{SE}, i}) \\
\text{Divorce}_{\text{TRUE}, i} & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = \alpha + \beta_1 \text A_i + \beta_2 \color{#5D01A6FF}{\text{Marriage}_{\text{TRUE}, i}} \\
\color{#5D01A6FF}{\text{Marriage}_{\text{OBS}, i}} & \color{#5D01A6FF}\sim \color{#5D01A6FF}{\operatorname{Normal}(\text{Marriage}_{\text{TRUE}, i}, \text{Marriage}_{\text{SE}, i})} \\
\color{#5D01A6FF}{\text{Marriage}_{\text{TRUE}, i}} & \color{#5D01A6FF}\sim \color{#5D01A6FF}{\operatorname{Normal}(0, 1)} \\
\alpha & \sim \operatorname{Normal}(0, 0.2) \\
\beta_1 & \sim \operatorname{Normal}(0, 0.5) \\
\beta_2 & \sim \operatorname{Normal}(0, 0.5) \\
\sigma & \sim \operatorname{Exponential}(1).
\end{align*}
The current version **brms** allows users to specify error on predictors with an `me()` statement in the form of `me(predictor, sd_predictor)` where `sd_predictor` is a vector in the data denoting the size of the measurement error, presumed to be in a standard-deviation metric.
```{r b15.2}
# put the data into a `list()`
dlist <- list(
D_obs = d$D_obs,
D_sd = d$D_sd,
M_obs = d$M_obs,
M_sd = d$M_sd,
A = d$A)
b15.2 <-
brm(data = dlist,
family = gaussian,
D_obs | mi(D_sd) ~ 1 + A + me(M_obs, M_sd),
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(normal(0, 1), class = meanme),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 15,
# note this line
save_pars = save_pars(latent = TRUE),
file = "fits/b15.02")
```
```{r, eval = F, echo = F}
library(rethinking)
dlist <- list(
D_obs = standardize( d$Divorce ),
D_sd = d$Divorce.SE / sd( d$Divorce ), M_obs = standardize( d$Marriage ),
M_sd = d$Marriage.SE / sd( d$Marriage ), A = standardize( d$MedianAgeMarriage ), N = nrow(d)
)
m15.2 <- ulam( alist(
D_obs ~ dnorm( D_true , D_sd ), vector[N]:D_true ~ dnorm( mu , sigma ), mu <- a + bA*A + bM*M_true[i],
M_obs ~ dnorm( M_true , M_sd ), vector[N]:M_true ~ dnorm( 0 , 1 ),
a ~ dnorm(0,0.2),
bA ~ dnorm(0,0.5),
bM ~ dnorm(0,0.5),
sigma ~ dexp( 1 )
) , data=dlist , chains=4 , cores=4 )
# summary
precis(m15.2, depth = 2)
```
We'll use `posterior_summary()`, again, to get a sense of those `depth=2` summaries.
```{r, results = 'hide'}
posterior_summary(b15.2) %>%
round(digits = 2)
```
Due to space concerns, I'm not going to show the results, here. You can do that on your own. Basically, now in addition to the posterior summaries for the `Yl[i]` parameters (what McElreath called $D_{\text{TRUE}, i}$), we now get posterior summaries for `Xme_meM_obs[i]` (what McElreath called $M_{\text{TRUE}, i}$). Note that you'll need to specify `save_pars = save_pars(latent = TRUE)` in the `brm()` function in order to save the posterior samples of error-adjusted variables obtained by using the `me()` argument. Without doing so, functions like `predict()` may give you trouble. Here's our version of Figure 15.3.
```{r, fig.width = 3.8, fig.height = 3.675}
color_y <- viridis_pal(option = "C")(7)[7]
color_p <- viridis_pal(option = "C")(7)[2]
# wrangle
full_join(
# D
tibble(Loc = d %>% pull(Loc),
D_obs = d %>% pull(D_obs),
D_est = posterior_summary(b15.2) %>%
data.frame() %>%
rownames_to_column("term") %>%
filter(str_detect(term, "Yl")) %>%
pull(Estimate)) %>%
pivot_longer(-Loc, values_to = "d") %>%
mutate(name = if_else(name == "D_obs", "observed", "posterior")),
# M
tibble(Loc = d %>% pull(Loc),
M_obs = d %>% pull(M_obs),
M_est = posterior_summary(b15.2) %>%
data.frame() %>%
rownames_to_column("term") %>%
filter(str_detect(term, "Xme_")) %>%
pull(Estimate)) %>%
pivot_longer(-Loc, values_to = "m") %>%
mutate(name = if_else(name == "M_obs", "observed", "posterior")),
by = c("Loc", "name")
) %>%
# plot!
ggplot(aes(x = m, y = d)) +
geom_line(aes(group = Loc),
linewidth = 1/4) +
geom_point(aes(color = name)) +
scale_color_manual(values = c(color_p, color_y)) +
labs(subtitle = "Shrinkage of both divorce rate and marriage rate",
x = "Marriage rate (std)" ,
y = "Divorce rate (std)")
```
The yellow points are model-implied; the purple ones are of the original data. It turns out our **brms** model regularized just a little more aggressively than McElreath's **rethinking** model.
Anyway,
> The big take home point for this section is that when you have a distribution of values, don't reduce it down to a single value to use in a regression. Instead, use the entire distribution. Anytime we use an average value, discarding the uncertainty around that average, we risk overconfidence and spurious inference. This doesn't only apply to measurement error, but also to cases in which data are averaged before analysis. (p. 497)
### Measurement terrors.
McElreath invited us to consider a few more DAGs. The first is an instance where both sources of measurement error have a common cause, $P$.
```{r, warning = F, message = F, fig.width = 3.75, fig.height = 1.5}
dag_coords <-
tibble(name = c("A", "M", "Mobs", "eM", "D", "Dobs", "eD", "P"),
x = c(1, 2, 3, 4, 2, 3, 4, 5),
y = c(2, 3, 3, 3, 1, 1, 1, 2))
dagify(M ~ A,
D ~ A + M,
Mobs ~ M + eM,
Dobs ~ D + eD,
eM ~ P,
eD ~ P,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("A", "Mobs", "Dobs", "P"), "b", "a")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(parse = T, label = c("A", "D", "M", "P",
expression(italic(e)[D]), expression(italic(e)[M]),
expression(D[obs]), expression(M[obs]))) +
geom_dag_edges(edge_colour = "#FCF9F0") +
scale_color_manual(values = c(viridis_pal(option = "C")(7)[2], "black")) +
dark_theme_void()
```
The second instance is when the true marriage rate $M$ has a causal effect on the measurement error for Divorce, $e_\text{D}$.
```{r, warning = F, message = F, fig.width = 3.5, fig.height = 1.5}
dag_coords <-
tibble(name = c("A", "M", "Mobs", "eM", "D", "Dobs", "eD"),
x = c(1, 2, 3, 4, 2, 3, 4),
y = c(2, 3, 3, 3, 1, 1, 1))
dagify(M ~ A,
D ~ A + M,
Mobs ~ M + eM,
Dobs ~ D + eD,
eD ~ M,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("A", "Mobs", "Dobs"), "b", "a")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(parse = T, label = c("A", "D", "M",
expression(italic(e)[D]), expression(italic(e)[M]),
expression(D[obs]), expression(M[obs]))) +
geom_dag_edges(edge_colour = "#FCF9F0") +
scale_color_manual(values = c(viridis_pal(option = "C")(7)[2], "black")) +
dark_theme_void()
```
The final example is when we have negligible measurement error for $M$ and $D$, but known nonignorable measurement error for the causal variable $A$.
```{r, warning = F, message = F, fig.width = 3.5, fig.height = 1.5}
dag_coords <-
tibble(name = c("eA", "Aobs", "A", "M", "D"),
x = c(1, 2, 3, 4, 4),
y = c(2, 2, 2, 3, 1))
dagify(Aobs ~ A + eA,
M ~ A,
D ~ A,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("A", "eA"), "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(parse = T, label = c("A", expression(italic(e)[A]), expression(A[obs]), "D", "M")) +
geom_dag_edges(edge_colour = "#FCF9F0") +
scale_color_manual(values = c(viridis_pal(option = "C")(7)[2], "black")) +
dark_theme_void()
```
On page 498, we read:
> In this circumstance, it can happen that a naive regression of $D$ on $A_\text{obs}$ and $M$ will strongly suggest that $M$ influences $D$. The reason is that $M$ contains information about the true $A$. And $M$ is measured more precisely than $A$ is. It's like a proxy $A$. Here's a small simulation you can toy with that will produce such a frustration:
```{r}
n <- 500
set.seed(15)
dat <-
tibble(A = rnorm(n, mean = 0, sd = 1)) %>%
mutate(M = rnorm(n, mean = -A, sd = 1),
D = rnorm(n, mean = A, sd = 1),
A_obs = rnorm(n, mean = A, sd = 1))
```
To get a sense of the havoc ignoring measurement error can cause, we'll fit to models. These aren't in the text, but, you know, let's live a little. The first model will include `A`, the true predictor for `D`. The second model will include `A_obs` instead, the version of `A` with measurement error added in.
```{r b15.2b}
# the model with A containing no measurement error
b15.2b <-
brm(data = dat,
family = gaussian,
D ~ 1 + A + M,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 15,
# note this line
save_pars = save_pars(latent = TRUE),
file = "fits/b15.02b")
# The model where A has measurement error, but we ignore it
b15.2c <-
brm(data = dat,
family = gaussian,
D ~ 1 + A_obs + M,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, cores = 4, chains = 4,
seed = 15,
# note this line
save_pars = save_pars(latent = TRUE),
file = "fits/b15.02c")
```
Check the summaries.
```{r}
print(b15.2b)
print(b15.2c)
```
Model `b15.2b`, where `A` contains no measurement error, comes close to reproducing the data-generating parameters. The second model, `b15.2c`, which used `A` infused with measurement error (i.e., `A_obs`), is a disaster. A coefficient plot might help the comparison.
```{r, fig.width = 6, fig.height = 3}
# for annotation
text <-
tibble(fit = "b15.2b",
term = "beta[0]",
Estimate = fixef(b15.2b, probs = .99)["Intercept", 3],
label = "In this plot, we like the yellow posteriors.")
# wrangle
bind_rows(posterior_summary(b15.2b) %>% data.frame() %>% rownames_to_column("term"),
posterior_summary(b15.2c) %>% data.frame() %>% rownames_to_column("term")) %>%
filter(term != "lp__") %>%
filter(term != "lprior") %>%
mutate(term = rep(c(str_c("beta[", 0:2, "]"), "sigma"), times = 2),
fit = rep(c("b15.2b", "b15.2c"), each = n() / 2)) %>%
# plot!
ggplot(aes(x = Estimate, y = fit)) +
geom_vline(xintercept = 0, linetype = 3, alpha = 1/2) +
geom_pointrange(aes(xmin = Q2.5, xmax = Q97.5, color = fit)) +
geom_text(data = text,
aes(label = label),
hjust = 0, color = color_y) +
scale_color_manual(values = c(color_y, "white")) +
labs(x = "marginal posterior",
y = NULL) +
theme(axis.ticks.y = element_blank(),
strip.background = element_rect(color = "transparent", fill = "transparent")) +
facet_wrap(~ term, labeller = label_parsed, ncol = 1)
```
## Missing data
> With measurement error, the insight is to realize that any uncertain piece of data can be replaced by a distribution that reflects uncertainty. But sometimes data are simply missing--no measurement is available at all. At first, this seems like a lost cause. What can be done when there is no measurement at all, not even one with error?...
> So what can we do instead? We can think causally about missingness, and we can use
the model to **impute** missing values. A generative model tells you whether the process that
produced the missing values will also prevent the identification of causal effects. (p. 499, **emphasis** in the original)
Starting with [version 2.2.0](https://cran.r-project.org/package=brms/news/news.html), **brms** supports Bayesian missing data imputation using adaptations of the [multivariate syntax](https://cran.r-project.org/package=brms/vignettes/brms_multivariate.html) [@Bürkner2022Multivariate]. Bürkner's [-@Bürkner2022HandleMissingValues] vignette, [*Handle missing values with brms*](https://cran.r-project.org/package=brms/vignettes/brms_missings.html), can provide a nice overview.
#### Rethinking: Missing data are meaningful data.
> The fact that a variable has an unobserved value is still an observation. It is data, just with a very special value. The meaning of this value depends upon the context. Consider for example a questionnaire on personal income. If some people refuse to fill in their income, this may be associated with low (or high) income. Therefore a model that tries to predict the missing values can be enlightening. (p. 499)
### DAG ate my homework.
We'll start this section off with our version of Figure 15.4. It's going to take a bit of effort on our part to make a nice representation those four DAGs. Here we make panels a, b, and d.
```{r}
# panel a
dag_coords <-
tibble(name = c("S", "H", "Hs", "D"),
x = c(1, 2, 2, 1),
y = c(2, 2, 1, 1))
p1 <-
dagify(H ~ S,
Hs ~ H + D,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name == "H", "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "S", "H*")) +
geom_dag_edges(edge_colour = "#FCF9F0")
# panel b
p2 <-
dagify(H ~ S,
Hs ~ H + D,
D ~ S,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name == "H", "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "S", "H*")) +
geom_dag_edges(edge_colour = "#FCF9F0")
# panel d
p4 <-
dagify(H ~ S,
Hs ~ H + D,
D ~ H,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name == "H", "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "S", "H*")) +
geom_dag_edges(edge_colour = "#FCF9F0")
```
Make panel c.
```{r}
dag_coords <-
tibble(name = c("S", "H", "Hs", "D", "X"),
x = c(1, 2, 2, 1, 1.5),
y = c(2, 2, 1, 1, 1.5))
p3 <-
dagify(H ~ S + X,
Hs ~ H + D,
D ~ X,
coords = dag_coords) %>%
tidy_dagitty() %>%
mutate(color = ifelse(name %in% c("H", "X"), "a", "b")) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = color),
size = 7, show.legend = F) +
geom_dag_text(label = c("D", "H", "S", "X", "H*")) +
geom_dag_edges(edge_colour = "#FCF9F0")
```
Now combine, adjust a little, and plot.
```{r, warning = F, message = F, fig.width = 4.5, fig.height = 4}
(p1 + p2 + p3 + p4) +
plot_annotation(tag_levels = "a", tag_prefix = "(", tag_suffix = ")") &
scale_color_manual(values = c(viridis_pal(option = "C")(7)[2], "black")) &
dark_theme_void() +
theme(panel.background = element_rect(fill = "grey8"),
plot.margin = margin(0.15, 0.15, 0.15, 0.15, "in"))
```
On page 500, we read:
> Consider a sample of students, all of whom own dogs. The students produce homework ($H$). This homework varies in quality, influenced by how much each student studies ($S$). We could simulate 100 students, their attributes, and their homework like this:
```{r}
n <- 100
set.seed(15)
d <-
tibble(s = rnorm(n, mean = 0, sd = 1)) %>%
mutate(h = rbinom(n, size = 10, inv_logit_scaled(s)),
d_a = rbinom(n, size = 1, prob = .5),
d_b = ifelse(s > 0, 1, 0)) %>%
mutate(hm_a = ifelse(d_a == 1, NA, h),
hm_b = ifelse(d_b == 1, NA, h))
d
```
In that code block, we simulated the data corresponding to McElreath's **R** code 15.8 through 15.10. We have two `d` and `hm` variables. `d_a` and `hm_a` correspond to McElreath's **R** code 15.9 and the DAG in panel a. `d_b` and `hm_b` correspond to McElreath's **R** code 15.10 and the DAG in panel b.
This wasn't in the text, but here we'll plot `h`, `hm_a`, and `hm_b` to get a sense of how the first two missing data examples compare to the original data.
```{r, fig.width = 8, fig.height = 2.75, warning = F}
p1 <-
d %>%
ggplot(aes(x = s, y = h)) +
geom_point(color = viridis_pal(option = "C")(7)[7], alpha = 2/3) +
scale_y_continuous(breaks = 1:10) +
labs(subtitle = "true distribution")
p2 <-
d %>%
ggplot(aes(x = s, y = hm_a)) +
geom_point(color = viridis_pal(option = "C")(7)[6], alpha = 2/3) +
scale_y_continuous(breaks = 1:10) +
labs(subtitle = "missing completely at random")
p3 <-
d %>%
ggplot(aes(x = s, y = hm_b)) +
geom_point(color = viridis_pal(option = "C")(7)[6], alpha = 2/3) +
scale_y_continuous(breaks = 1:10, limits = c(1, 10)) +
labs(subtitle = "missing conditional on s")
p1 + p2 + p3
```
The left panel is the ideal situation letting us learn what we want to know, what is the effect of studying on the grade you'll get on your homework ($S \rightarrow H$). Once we enter in a missing data process (i.e., dogs $D$ eating homework), we end up with $H^*$, the homework left over after the dogs. Thus the homework outcomes we collect are a combination of the full set of homework and the hungry dogs. The middle panel depicts the scenario where the dogs eat the homework completely at random, $H \rightarrow H^* \leftarrow D$. In the right panel, we consider a scenario where the dogs only and always eat the homework on the occasions the students studied more than average, $H \rightarrow H^* \leftarrow D \leftarrow S$.
The situation in the third DAG is more complicated. Now homework is conditional on both studying and how noisy it is in a students home, $X$. Also, our new variable $X$ isn't measured and whether the dogs eat the homework is also conditional on that unmeasured $X$. Here's the new data simulation.
```{r}
n <- 1000
set.seed(501)
d <-
tibble(x = rnorm(n, mean = 0, sd = 1),
s = rnorm(n, mean = 0, sd = 1)) %>%
mutate(h = rbinom(n, size = 10, inv_logit_scaled(2 + s - 2 * x)),
d = ifelse(x > 1, 1, 0)) %>%
mutate(hm = ifelse(d == 1, NA, h))
d
```
Those data look like this.
```{r, fig.width = 5.5, fig.height = 2.75, warning = F}
p1 <-
d %>%
ggplot(aes(x = s, y = h)) +
geom_point(color = viridis_pal(option = "C")(7)[7], alpha = 1/4) +
scale_y_continuous(breaks = 1:10) +
labs(subtitle = "true distribution")
p2 <-
d %>%
ggplot(aes(x = s, y = hm)) +
geom_point(color = viridis_pal(option = "C")(7)[6], alpha = 1/4) +
scale_y_continuous(breaks = 1:10) +
labs(subtitle = "missing conditional on x")
p1 + p2
```
Fit the model using the data with no missingness.
```{r b15.3}
b15.3 <-
brm(data = d,
family = binomial,
h | trials(10) ~ 1 + s,
prior = c(prior(normal(0, 1), class = Intercept),
prior(normal(0, 0.5), class = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 15,
file = "fits/b15.03")
```
Check the results.
```{r}
print(b15.3)
```
Since this is not the data-generating model, we shouldn't be all that surprised the coefficient for `s` is off (it should be 1). Because this is an example of where we didn't collect data on $X$, we can think of our incorrect results as a case of **omitted variable bias**. Here's what happens when we run the model on `hm`, the homework variable after the hungry dogs got to it.
```{r b15.4}
b15.4 <-
brm(data = d %>% filter(d == 0),
family = binomial,
h | trials(10) ~ 1 + s,
prior = c(prior(normal(0, 1), class = Intercept),
prior(normal(0, 0.5), class = b)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 15,
file = "fits/b15.04")
```
Check the results.
```{r}
print(b15.4)
```
Interestingly, both the intercept and the coefficient for `s` are now less biased. Because both $H$ and $D$ are conditional on $X$, omitting cases based on $X$ resulted in a model that conditional on $X$, even though $X$ wasn't directly in the statistical model. This won't always be the case. Consider what happens when we have a different missing data mechanism.
```{r}
d <-
d %>%
mutate(d = ifelse(abs(x) < 1, 1, 0)) %>%
mutate(hm = ifelse(d == 1, NA, h))
d
```
Here's what then updated data look like.
```{r, fig.width = 5.5, fig.height = 2.75, warning = F}
p1 <-
d %>%
ggplot(aes(x = s, y = h)) +
geom_point(color = viridis_pal(option = "C")(7)[7], alpha = 1/4) +
scale_y_continuous(breaks = 1:10) +
labs(subtitle = "true distribution")
p2 <-
d %>%
ggplot(aes(x = s, y = hm)) +
geom_point(color = viridis_pal(option = "C")(7)[6], alpha = 1/4) +
scale_y_continuous(breaks = 1:10) +
labs(subtitle = "missing conditional on x")
p1 + p2
```
McElreath didn't fit this model in the text, but he encouraged us to do so on our own (p. 503). Here it is.
```{r b15.4b}
b15.4b <-
brm(data = d %>% filter(d == 0),
family = binomial,
h | trials(10) ~ 1 + s,