-
Notifications
You must be signed in to change notification settings - Fork 1
/
05.Rmd
2290 lines (1805 loc) · 95.9 KB
/
05.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# The Many Variables & The Spurious Waffles
> Correlation in general is not surprising. In large data sets, every pair of variables has a statistically discernible non-zero correlation. But since most correlations do not indicate causal relationships, we need tools for distinguishing mere association from evidence of causation. This is why so much effort is devoted to **multiple regression**, using more than one predictor variable to simultaneously model an outcome. [@mcelreathStatisticalRethinkingBayesian2020, p. 123, **emphasis** in the original]
In his endnote #80 (p. 562), McElreath wrote: "See @meehlWhySummariesResearch1990, in particular the 'crud factor' described on page 204." For a fun look at some dubious correlations, check out the examples at [https://www.tylervigen.com/spurious-correlations](https://www.tylervigen.com/spurious-correlations).
But back to the text, McElreath's listed reasons for multivariable regression include:
* statistical control for confounds
* multiple/complex causation
* interactions
We'll approach the first two in this chapter. Interactions are reserved for [Chapter 7][Ulysses' Compass].
#### Rethinking: Causal inference.
"Despite its central importance, there is no unified approach to causal inference yet in the sciences" (p. 124). To dip into the topic, you might check out the recent blog post by Finn Lattimore and David Rohde, [*Causal inference with Bayes rule*](https://gradientinstitute.org/blog/6/); McElreath blog series on causal inference, starting with [*Regression, Fire, and Dangerous Things (1/3)*](https://elevanth.org/blog/2021/06/15/regression-fire-and-dangerous-things-1-3/); or McElreath's epic 3-hour introductory lecture on causal inference called [*Science Before Statistics: Causal Inference*](https://youtu.be/KNPYUVmY3NM).
## Spurious associations
Load the [Waffle House](https://www.snopes.com/fact-check/fema-waffle-house-index/) data.
```{r, message = F, warning = F}
library(tidyverse)
data(WaffleDivorce, package = "rethinking")
d <- WaffleDivorce
```
Did you notice how we used the `package` argument within the `data()` function, there? That allowed us to load the `WaffleDivorce` without actually loading the **rethinking** package. Since we generally don't want to have both **rethinking** and **brms** loaded up at the same time, using the `package` argument will save us a line of code.
Now standardize the focal variables with the `rethinking::standardize()` function.
```{r}
d <-
d %>%
mutate(d = rethinking::standardize(Divorce),
m = rethinking::standardize(Marriage),
a = rethinking::standardize(MedianAgeMarriage))
```
Because we avoided directly loading the **rethinking** package, we did not have immediate access to McElreath's handy `standardize()` function. If you want to use a function from a package without loading that package, you can use the double colon operator `::`. You can learn more about the double colon operator [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ns-dblcolon.html). Now load **brms**.
```{r, message = F, warning = F}
rm(WaffleDivorce)
library(brms)
```
I'm not going to show the output, but you might go ahead and investigate the data with the typical functions. E.g.,
```{r, results = "hide"}
head(d)
glimpse(d)
```
Now we have our data, we can reproduce Figure 5.1. One convenient way to get the handful of sate labels into the plot was with the `geom_text_repel()` function from the [**ggrepel** package](https://CRAN.R-project.org/package=ggrepel) [@R-ggrepel]. But first, we spent the last few chapters warming up with **ggplot2**. Going forward, each chapter will have its own plot theme. In this chapter, we’ll characterize the plots with `theme_bw() + theme(panel.grid = element_rect())` and coloring based off of `"firebrick"`.
```{r, fig.width = 3, fig.height = 3, message = F}
library(ggrepel)
d %>%
ggplot(aes(x = WaffleHouses/Population, y = Divorce)) +
stat_smooth(method = "lm", fullrange = T, size = 1/2,
color = "firebrick4", fill = "firebrick", alpha = 1/5) +
geom_point(size = 1.5, color = "firebrick4", alpha = 1/2) +
geom_text_repel(data = d %>% filter(Loc %in% c("ME", "OK", "AR", "AL", "GA", "SC", "NJ")),
aes(label = Loc),
size = 3, seed = 1042) + # this makes it reproducible
scale_x_continuous("Waffle Houses per million", limits = c(0, 55)) +
ylab("Divorce rate") +
coord_cartesian(xlim = c(0, 50), ylim = c(5, 15)) +
theme_bw() +
theme(panel.grid = element_blank())
```
Since these are geographically-based data, we might plot our three major variables in a map format. The [**tigris** package](https://github.com/walkerke/tigris) [@R-tigris] provides functions for retrieving latitude and longitude data for the 50 states and we can plot then with the `ggplot2::geom_sf()` function. We'll use the `right_join()` function to combine those data with our primary data `d`[^2].
```{r, warning = F, message = F, results = "hide"}
library(tigris)
# get the map data
d_states <- states(cb = TRUE, resolution = "20m") %>%
shift_geometry() %>%
# add the primary data
right_join(d %>%
mutate(NAME = Location %>% as.character()) %>%
select(d:a, NAME),
by = "NAME") %>%
# convert to the long format for faceting
pivot_longer(cols = c("d", "m", "a"), names_to = "variable")
```
Now plot.
```{r, fig.width = 8, fig.height = 2, warning = F, message = F}
d_states %>%
ggplot() +
geom_sf(aes(fill = value, geometry = geometry),
size = 0) +
scale_fill_gradient(low = "#f8eaea", high = "firebrick4") +
theme_void() +
theme(legend.position = "none",
strip.text = element_text(margin = margin(0, 0, .5, 0))) +
facet_wrap(~ variable, labeller = label_both)
```
One of the advantages of this visualization method is it just became clear that Nevada is missing from the `WaffleDivorce` data. Execute `d %>% distinct(Location)` to see for yourself and click [here](https://github.com/rmcelreath/rethinking/issues/62) to find out why it's missing. Those missing data should motivate the skills we'll cover in [Chapter 15][Missing Data and Other Opportunities]. But let's get back on track.
Here's the standard deviation for `MedianAgeMarriage` in its current metric.
```{r}
sd(d$MedianAgeMarriage)
```
```{r, echo = F, eval = F}
# Here we'll officially standardize the predictor, `MedianAgeMarriage`, and our criterion, `Divorce`. Before we jump in, we should address a technicality. If we just use Base R `scale()` within our tidyverse framework, [it will cause down-the-road problems](https://stackoverflow.com/questions/35775696/trying-to-use-dplyr-to-group-by-and-apply-scale). In short, `scale()` expects a matrix and we're decidedly working within a data frame framework. We have a few options. One, you can always standardize by hand (e.g., `MedianAgeMarriage_s = (MedianAgeMarriage - mean(MedianAgeMarriage)) / sd(MedianAgeMarriage)`). Two, you can make your own custom scaling function. Three, you can make your own custom function Anticipating the next model on page 127, here we'll showcase all three.
scale_this <- function(x, center = TRUE, scale = TRUE) {
as.vector(scale(x))
}
d <-
d %>%
mutate(MedianAgeMarriage_s = (MedianAgeMarriage - mean(MedianAgeMarriage)) / sd(MedianAgeMarriage),
Divorce_s = scale_this(Divorce),
Marriage_s = scale(Marriage) %>% as.vector())
```
Our first statistical model follows the form
\begin{align*}
\text{divorce_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = \alpha + \beta_1 \text{median_age_at_marriage_std}_i \\
\alpha & \sim \operatorname{Normal}(0, 0.2) \\
\beta_1 & \sim \operatorname{Normal}(0, 0.5) \\
\sigma & \sim \operatorname{Exponential}(1),
\end{align*}
where the `_std` suffix indicates the variables are standardized (i.e., zero centered, with a standard deviation of one). Let's fit the first univariable model.
```{r b5.1}
b5.1 <-
brm(data = d,
family = gaussian,
d ~ 1 + a,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
sample_prior = T,
file = "fits/b05.01")
```
Did you notice the `sample_prior = T` line? This told **brms** to take draws from both the posterior distribution (as usual) and from the prior predictive distribution. If you look at McElreath's **R** code 5.4, you'll see he plotted 50 draws from the prior predictive distribution of his `m5.1`. For our **brms** workflow, our first step is the extract our prior draws with the well-named `prior_draws()` function.
```{r}
prior <- prior_draws(b5.1)
prior %>% glimpse()
```
We ended up with 4,000 draws from the prior predictive distribution, much like `as_draws_df()` would return 4,000 draws from the posterior. Next we'll use `slice_sample()` to take a random sample from our `prior` object. After just a little more wrangling, we'll be in good shape to plot our version of Figure 5.3.
```{r, fig.width = 3, fig.height = 3}
set.seed(5)
prior %>%
slice_sample(n = 50) %>%
rownames_to_column("draw") %>%
expand(nesting(draw, Intercept, b),
a = c(-2, 2)) %>%
mutate(d = Intercept + b * a) %>%
ggplot(aes(x = a, y = d)) +
geom_line(aes(group = draw),
color = "firebrick", alpha = .4) +
labs(x = "Median age marriage (std)",
y = "Divorce rate (std)") +
coord_cartesian(ylim = c(-2, 2)) +
theme_bw() +
theme(panel.grid = element_blank())
```
To get the posterior predictions from our **brms** model, we'll use `fitted()` in place of `link()`.
```{r, fig.width = 3, fig.height = 3}
# determine the range of `a` values we'd like to feed into `fitted()`
nd <- tibble(a = seq(from = -3, to = 3.2, length.out = 30))
# now use `fitted()` to get the model-implied trajectories
fitted(b5.1,
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
# plot
ggplot(aes(x = a)) +
geom_smooth(aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "firebrick", color = "firebrick4", alpha = 1/5, size = 1/4) +
geom_point(data = d,
aes(y = d),
size = 2, color = "firebrick4") +
labs(x = "Median age marriage (std)",
y = "Divorce rate (std)") +
coord_cartesian(xlim = range(d$a),
ylim = range(d$d)) +
theme_bw() +
theme(panel.grid = element_blank())
```
That'll serve as our version of the right panel of Figure 5.2. To paraphrase McElreath, "if you inspect the [`print()`] output, you’ll see that posterior for $[\beta_\text{a}]$ is reliably negative" (p. 127). Let's see.
```{r}
print(b5.1)
```
On the standardized scale, -0.57 95% CI [-0.79, -0.34] is pretty negative, indeed.
We're ready to fit our second univariable model.
```{r b5.2}
b5.2 <-
brm(data = d,
family = gaussian,
d ~ 1 + m,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/b05.02")
```
The summary suggests $\beta_\text{m}$ is of a smaller magnitude.
```{r}
print(b5.2)
```
Now we'll wangle and plot our version of the left panel in Figure 5.2.
```{r, fig.width = 3, fig.height = 3}
nd <- tibble(m = seq(from = -2.5, to = 3.5, length.out = 30))
fitted(b5.2, newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = m)) +
geom_smooth(aes(y = Estimate, ymin = Q2.5, ymax = Q97.5),
stat = "identity",
fill = "firebrick", color = "firebrick4", alpha = 1/5, size = 1/4) +
geom_point(data = d,
aes(y = d),
size = 2, color = "firebrick4") +
labs(x = "Marriage rate (std)",
y = "Divorce rate (std)") +
coord_cartesian(xlim = range(d$m),
ylim = range(d$d)) +
theme_bw() +
theme(panel.grid = element_blank())
```
> But merely comparing parameter means between different bivariate regressions is no way to decide which predictor is better. Both of these predictors could provide independent value, or they could be redundant, or one could eliminate the value of the other.
>
> To make sense of this, we're going to have to think causally. And then, only after we've done some thinking, a bigger regression model that includes both age at marriage and marriage rate will help us. (pp. 127--128)
### Think before you regress.
> It is helpful to introduce a particular type of causal graph known as a **DAG**, short for **directed acyclic graph**. *Graph* means it is nodes and connections. *Directed* means the connections have arrows that indicate directions of causal influence. And *acyclic* means that causes do not eventually flow back on themselves. A DAG is a way of describing qualitative causal relationships among variables. It isn't as detailed as a full model description, but it contains information that a purely statistical model does not. Unlike a statistical model, a DAG will tell you the consequences of intervening to change a variable. But only if the DAG is correct. There is no inference without assumption. (p. 128, **emphasis** in the original)
If you're interested in making directed acyclic graphs (DAG) in **R**, the [**dagitty**](https://CRAN.R-project.org/package=dagitty) [@R-dagitty; @dagitty2016] and [**ggdag**](https://CRAN.R-project.org/package=ggdag) [@R-ggdag] packages are handy. Our approach will focus on **ggdag**.
```{r, warning = F, message = F}
# library(dagitty)
library(ggdag)
```
If all you want is a quick and dirty DAG for our three variables, you might execute something like this.
```{r, fig.width = 3, fig.height = 1.75}
set.seed(5)
dagify(M ~ A,
D ~ A + M) %>%
ggdag(node_size = 8)
```
We can pretty it up a little, too.
```{r, fig.width = 3, fig.height = 1.5}
dag_coords <-
tibble(name = c("A", "M", "D"),
x = c(1, 3, 2),
y = c(2, 2, 1))
p1 <-
dagify(M ~ A,
D ~ A + M,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(color = "firebrick", alpha = 1/4, size = 10) +
geom_dag_text(color = "firebrick") +
geom_dag_edges(edge_color = "firebrick") +
scale_x_continuous(NULL, breaks = NULL, expand = c(0.1, 0.1)) +
scale_y_continuous(NULL, breaks = NULL, expand = c(0.2, 0.2)) +
theme_bw() +
theme(panel.grid = element_blank())
p1
```
We could have left out the `coords` argument and let the `dagify()` function set the layout of the nodes on its own. But since we were picky and wanted to ape McElreath, we first specified our coordinates in a tibble and then included that tibble in the `coords` argument. For more on the topic, check out the Barrett's [-@barrettAnIntroduction2022] vignette, [*An introduction to ggdag*](https://CRAN.R-project.org/package=ggdag/vignettes/intro-to-ggdag.html).
Buy anyway, our DAG
> represents a heuristic causal model. Like other models, it is an analytical assumption. The symbols $A$, $M$, and $D$ are our observed variables. The arrows show directions of influence. What this DAG says is:
>
> 1. $A$ directly influences $D$
> 2. $M$ directly influences $D$
> 3. $A$ directly influences $M$
>
> These statements can then have further implications. In this case, age of marriage influences divorce in two ways. First it has a direct effect, $A \rightarrow D$. Perhaps a direct effect would arise because younger people change faster than older people and are therefore more likely to grow incompatible with a partner. Second, it has an indirect effect by influencing the marriage rate, which then influences divorce, $A \rightarrow M \rightarrow D$. If people get married earlier, then the marriage rate may rise, because there are more young people. (p. 128)
Considering alternative models, "It could be that the association between $M$ and $D$ arises entirely from $A$'s influence on both $M$ and $D$. Like this:" (p. 129)
```{r, fig.width = 3, fig.height = 1.5}
p2 <-
dagify(M ~ A,
D ~ A,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(color = "firebrick", alpha = 1/4, size = 10) +
geom_dag_text(color = "firebrick") +
geom_dag_edges(edge_color = "firebrick") +
scale_x_continuous(NULL, breaks = NULL, expand = c(0.1, 0.1)) +
scale_y_continuous(NULL, breaks = NULL, expand = c(0.2, 0.2)) +
theme_bw() +
theme(panel.grid = element_blank())
p2
```
> This DAG is also consistent with the posterior distributions of models [`b5.1`] and [`b5.2`]. Why? Because both $M$ and $D$ "listen" to $A$. They have information from $A$. So when you inspect the association between $D$ and $M$, you pick up that common information that they both got from listening to $A$. You’ll see a more formal way to deduce this, in the next chapter.
>
> So which is it? Is there a direct effect of marriage rate, or rather is age at marriage just driving both, creating a spurious correlation between marriage rate and divorce rate? To find out, we need to consider carefully what each DAG implies. That's what's next. (p. 129)
#### Rethinking: What's a cause?
> Questions of causation can become bogged down in philosophical debates. These debates are worth having. But they don’t usually intersect with statistical concerns. Knowing a cause in statistics means being able to correctly predict the consequences of an intervention. There are contexts in which even this is complicated. (p. 129)
### Testable implications.
So far, we have entertained two DAGs. Here we use **patchwork** to combine them into one plot.
```{r, fig.width = 5.5, fig.height = 1.5}
library(patchwork)
p1 | p2
```
McElreath encouraged us to examine the correlations among these three variables with `cor()`.
```{r}
d %>%
select(d:a) %>%
cor()
```
If you just want the lower triangle, you can use the `lowerCor()` function from the [**psych** package](https://CRAN.R-project.org/package=psych) [@R-psych].
```{r, warning = F, message = F}
library(psych)
d %>%
select(d:a) %>%
lowerCor(digits = 3)
```
Our second DAG, above, suggests "that $D$ is independent of $M$, conditional on $A$" (p. 130). We can use the `dagitty::impliedConditionalIndependencies()` function to express that conditional independence in formal notation.
```{r, warning = F, message = F}
library(dagitty)
dagitty('dag{ D <- A -> M }') %>%
impliedConditionalIndependencies()
```
The lack of conditional dependencies in the first DAG may be expressed this way.
```{r}
dagitty('dag{D <- A -> M -> D}') %>%
impliedConditionalIndependencies()
```
Okay, that was a bit of a tease. "There are no conditional independencies, so there is no output to display" (p. 131). To close out this section,
> once you fit a multiple regression to predict divorce using both marriage rate and age at marriage, the model addresses the questions:
>
> 1. After I already know marriage rate, what additional value is there in also knowing age at marriage?
> 2. After I already know age at marriage, what additional value is there in also knowing marriage rate?
>
> The parameter estimates corresponding to each predictor are the (often opaque) answers to these questions. The questions above are descriptive, and the answers are also descriptive. It is only the derivation of the testable implications above that gives these descriptive results a causal meaning. But that meaning is still dependent upon believing the DAG. (p. 131)
### Multiple regression notation.
We can write the statistical formula for our first multivariable model as
\begin{align*}
\text{Divorce_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = \alpha + \beta_1 \text{Marriage_std}_i + \beta_2 \text{MedianAgeMarriage_std}_i \\
\alpha & \sim \operatorname{Normal}(0, 0.2) \\
\beta_1 & \sim \operatorname{Normal}(0, 0.5) \\
\beta_2 & \sim \operatorname{Normal}(0, 0.5) \\
\sigma & \sim \operatorname{Exponential}(1).
\end{align*}
### Approximating the posterior.
Much like we used the `+` operator to add single predictors to the intercept, we just use more `+` operators in the `formula` argument to add more predictors. Also notice we're using the same prior `prior(normal(0, 1), class = b)` for both predictors. Within the **brms** framework, they are both of `class = b`. But if we wanted their priors to differ, we'd make two `prior()` statements and differentiate them with the `coef` argument. You'll see examples of that later on.
```{r b5.3}
b5.3 <-
brm(data = d,
family = gaussian,
d ~ 1 + m + a,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/b05.03")
```
Behold the summary.
```{r}
print(b5.3)
```
The **brms** package doesn't have a convenience function like `rethinking::coeftab()`. However, we can make something similar with a little deft wrangling and **ggplot2** code.
```{r, fig.width = 4, fig.height = 2.25, warning = F, message = F}
# first, extract and rename the necessary posterior parameters
bind_cols(
as_draws_df(b5.1) %>%
transmute(`b5.1_beta[A]` = b_a),
as_draws_df(b5.2) %>%
transmute(`b5.2_beta[M]` = b_m),
as_draws_df(b5.3) %>%
transmute(`b5.3_beta[M]` = b_m,
`b5.3_beta[A]` = b_a)
) %>%
# convert them to the long format, group, and get the posterior summaries
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value),
ll = quantile(value, prob = .025),
ul = quantile(value, prob = .975)) %>%
# since the `key` variable is really two variables in one, here we split them up
separate(col = name, into = c("fit", "parameter"), sep = "_") %>%
# plot!
ggplot(aes(x = mean, xmin = ll, xmax = ul, y = fit)) +
geom_vline(xintercept = 0, color = "firebrick", alpha = 1/5) +
geom_pointrange(color = "firebrick") +
labs(x = "posterior", y = NULL) +
theme_bw() +
theme(panel.grid = element_blank(),
strip.background = element_rect(fill = "transparent", color = "transparent")) +
facet_wrap(~ parameter, ncol = 1, labeller = label_parsed)
```
Don't worry, coefficient plots won't always be this complicated. We'll walk out simpler ones toward the end of the chapter.
The substantive interpretation of all those coefficients is: "*Once we know median age at marriage for a State, there is little or no additional predictive power in also knowing the rate of marriage in that State*" (p. 134, *emphasis* in the original). This coheres well with one of our `impliedConditionalIndependencies()` statements, from above.
```{r}
dagitty('dag{ D <- A -> M }') %>%
impliedConditionalIndependencies()
```
#### Overthinking: Simulating the divorce example.
Okay, let's simulate our divorce data in a **tidyverse** sort of way.
```{r}
# how many states would you like?
n <- 50
set.seed(5)
sim_d <-
tibble(age = rnorm(n, mean = 0, sd = 1)) %>% # sim A
mutate(mar = rnorm(n, mean = -age, sd = 1), # sim A -> M
div = rnorm(n, mean = age, sd = 1)) # sim A -> D
head(sim_d)
```
We simulated those data based on this formulation.
```{r}
dagitty('dag{divorce <- age -> marriage}') %>%
impliedConditionalIndependencies()
```
Here are the quick `pairs()` plots.
```{r, fig.width = 4.5, fig.height = 4.5}
pairs(sim_d, col = "firebrick4")
```
If we use the `update()` function, we can refit the last models in haste.
```{r b5.1_sim}
b5.1_sim <-
update(b5.1,
newdata = sim_d,
formula = div ~ 1 + age,
seed = 5,
file = "fits/b05.01_sim")
b5.2_sim <-
update(b5.2,
newdata = sim_d,
formula = div ~ 1 + mar,
seed = 5,
file = "fits/b05.02_sim")
b5.3_sim <-
update(b5.3,
newdata = sim_d,
formula = div ~ 1 + mar + age,
seed = 5,
file = "fits/b05.03_sim")
```
The steps for our homemade `coefplot()` plot are basically the same. Just switch out some of the names.
```{r, fig.width = 4, fig.height = 2.25, warning = F, message = F}
bind_cols(
as_draws_df(b5.1_sim) %>%
transmute(`b5.1_beta[A]` = b_age),
as_draws_df(b5.2_sim) %>%
transmute(`b5.2_beta[M]` = b_mar),
as_draws_df(b5.3_sim) %>%
transmute(`b5.3_beta[M]` = b_mar,
`b5.3_beta[A]` = b_age)
) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value),
ll = quantile(value, prob = .025),
ul = quantile(value, prob = .975)) %>%
# since the `key` variable is really two variables in one, here we split them up
separate(name, into = c("fit", "parameter"), sep = "_") %>%
# plot!
ggplot(aes(x = mean, xmin = ll, xmax = ul, y = fit)) +
geom_vline(xintercept = 0, color = "firebrick", alpha = 1/5) +
geom_pointrange(color = "firebrick") +
labs(x = "posterior", y = NULL) +
theme_bw() +
theme(panel.grid = element_blank(),
strip.background = element_blank()) +
facet_wrap(~ parameter, ncol = 1, labeller = label_parsed)
```
Well, okay. This is the same basic pattern, but with the signs switched and with a little simulation variability thrown in. But you get the picture.
### Plotting multivariate posteriors.
"Let's pause for a moment, before moving on. There are a lot of moving parts here: three variables, some strange DAGs, and three models. If you feel at all confused, it is only because you are paying attention" (p. 133).
Preach, brother.
Down a little further, McElreath gave us this deflationary delight: "There is a huge literature detailing a variety of plotting techniques that all attempt to help one understand multiple linear regression. None of these techniques is suitable for all jobs, and most do not generalize beyond linear regression" (pp. 134--135). Now you're inspired, let's learn three:
* predictor residual plots
* posterior prediction plots
* counterfactual plots
#### Predictor residual plots.
To get ready to make our residual plots, we'll predict one predictor, `m`, with another one, `a`.
```{r b5.4}
b5.4 <-
brm(data = d,
family = gaussian,
m ~ 1 + a,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/b05.04")
```
```{r}
print(b5.4)
```
With `fitted()`, we compute the expected values for each state (with the exception of Nevada). Since the `a` values for each state are in the date we used to fit the model, we'll omit the `newdata` argument.
```{r}
f <-
fitted(b5.4) %>%
data.frame() %>%
bind_cols(d)
glimpse(f)
```
After a little data processing, we can make the upper left panel of Figure 5.4.
```{r, fig.width = 3, fig.height = 3}
p1 <-
f %>%
ggplot(aes(x = a, y = m)) +
geom_point(size = 2, shape = 1, color = "firebrick4") +
geom_segment(aes(xend = a, yend = Estimate),
size = 1/4) +
geom_line(aes(y = Estimate),
color = "firebrick4") +
geom_text_repel(data = . %>% filter(Loc %in% c("WY", "ND", "ME", "HI", "DC")),
aes(label = Loc),
size = 3, seed = 14) +
labs(x = "Age at marriage (std)",
y = "Marriage rate (std)") +
coord_cartesian(ylim = range(d$m)) +
theme_bw() +
theme(panel.grid = element_blank())
p1
```
We get the residuals with the well-named `residuals()` function. Much like with `brms::fitted()`, `brms::residuals()` returns a four-vector matrix with the number of rows equal to the number of observations in the original data (by default, anyway). The vectors have the familiar names: `Estimate`, `Est.Error`, `Q2.5`, and `Q97.5`. See the [**brms** reference manual](https://CRAN.R-project.org/package=brms/brms.pdf) [@brms2022RM] for details.
With our residuals in hand, we just need a little more data processing to make lower left panel of Figure 5.4.
```{r, fig.width = 3, fig.height = 3, message = F}
r <-
residuals(b5.4) %>%
# to use this in ggplot2, we need to make it a tibble or data frame
data.frame() %>%
bind_cols(d)
p3 <-
r %>%
ggplot(aes(x = Estimate, y = d)) +
stat_smooth(method = "lm", fullrange = T,
color = "firebrick4", fill = "firebrick4",
alpha = 1/5, size = 1/2) +
geom_vline(xintercept = 0, linetype = 2, color = "grey50") +
geom_point(size = 2, color = "firebrick4", alpha = 2/3) +
geom_text_repel(data = . %>% filter(Loc %in% c("WY", "ND", "ME", "HI", "DC")),
aes(label = Loc),
size = 3, seed = 5) +
scale_x_continuous(limits = c(-2, 2)) +
coord_cartesian(xlim = range(r$Estimate)) +
labs(x = "Marriage rate residuals",
y = "Divorce rate (std)") +
theme_bw() +
theme(panel.grid = element_blank())
p3
```
To get the `MedianAgeMarriage_s` residuals, we have to fit the corresponding model where `m` predicts `a`.
```{r b5.4b}
b5.4b <-
brm(data = d,
family = gaussian,
a ~ 1 + m,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/b05.04b")
```
With `b5.4b` in hand, we're ready to make the upper right panel of Figure 5.4.
```{r, fig.width = 3, fig.height = 3}
p2 <-
fitted(b5.4b) %>%
data.frame() %>%
bind_cols(d) %>%
ggplot(aes(x = m, y = a)) +
geom_point(size = 2, shape = 1, color = "firebrick4") +
geom_segment(aes(xend = m, yend = Estimate),
size = 1/4) +
geom_line(aes(y = Estimate),
color = "firebrick4") +
geom_text_repel(data = . %>% filter(Loc %in% c("DC", "HI", "ID")),
aes(label = Loc),
size = 3, seed = 5) +
labs(x = "Marriage rate (std)",
y = "Age at marriage (std)") +
coord_cartesian(ylim = range(d$a)) +
theme_bw() +
theme(panel.grid = element_blank())
p2
```
And now we'll get the new batch of residuals, do a little data processing, and make a plot corresponding to the final panel of Figure 5.4.
```{r, fig.width = 3, fig.height = 3, message = F}
r <-
residuals(b5.4b) %>%
data.frame() %>%
bind_cols(d)
p4 <-
r %>%
ggplot(aes(x = Estimate, y = d)) +
stat_smooth(method = "lm", fullrange = T,
color = "firebrick4", fill = "firebrick4",
alpha = 1/5, size = 1/2) +
geom_vline(xintercept = 0, linetype = 2, color = "grey50") +
geom_point(size = 2, color = "firebrick4", alpha = 2/3) +
geom_text_repel(data = . %>% filter(Loc %in% c("ID", "HI", "DC")),
aes(label = Loc),
size = 3, seed = 5) +
scale_x_continuous(limits = c(-2, 3)) +
coord_cartesian(xlim = range(r$Estimate),
ylim = range(d$d)) +
labs(x = "Age at marriage residuals",
y = "Divorce rate (std)") +
theme_bw() +
theme(panel.grid = element_blank())
p4
```
Here we close out the section by combining our four subplots into one glorious whole with a little **patchwork** syntax.
```{r, fig.width = 6, fig.height = 6, message = F}
p1 + p2 + p3 + p4 + plot_annotation(title = "Understanding multiple regression through residuals")
```
##### Rethinking: Residuals are parameters, not data.
> There is a tradition, especially in parts of biology, of using residuals from one model as data in another model. For example, a biologist might regress brain size on body size and then use the brain size residuals as data in another model. This procedure is always a mistake. Residuals are not known. They are parameters, variables with unobserved values. Treating them as known values throws away uncertainty. (p. 137)
Let's hammer this point home. Recall how `brms::residuals()` returns four columns: `Estimate`, `Est.Error`, `Q2.5`, and `Q97.5`.
```{r, warning = F}
r %>%
glimpse()
```
In the residual plots from the lower two panels of Figure 5.4, we focused on the means of the residuals (i.e., `Estimate`). However, we can express the uncertainty in the residuals by including error bars for the 95% intervals. Here's what that might look like with a slight reworking of the lower right panel of Figure 5.4.
```{r, fig.width = 3, fig.height = 3, message = F}
r %>%
ggplot(aes(x = Estimate, y = d)) +
stat_smooth(method = "lm", fullrange = T,
color = "firebrick4", fill = "firebrick4",
alpha = 1/5, size = 1/2) +
geom_vline(xintercept = 0, linetype = 2, color = "grey50") +
# the only change is here
geom_pointrange(aes(xmin = Q2.5, xmax = Q97.5),
color = "firebrick4", alpha = 2/3) +
geom_text_repel(data = . %>% filter(Loc %in% c("ID", "HI", "DC")),
aes(label = Loc),
size = 3, seed = 5) +
scale_x_continuous(limits = c(-2, 3)) +
coord_cartesian(xlim = range(r$Estimate),
ylim = range(d$d)) +
labs(x = "Age at marriage residuals",
y = "Divorce rate (std)") +
theme_bw() +
theme(panel.grid = element_blank())
```
Look at that. If you were to fit a follow-up model based on only the point estimates (posterior means) of those residuals, you'd be ignoring a lot of uncertainty. For more on the topic of residuals, see @freckleton2002misuse, [*On the misuse of residuals in ecology: regression of residuals vs. multiple regression*](https://doi.org/10.1046/j.1365-2656.2002.00618.x).
#### Posterior prediction plots.
"It's important to check the model's implied predictions against the observed data" (p. 137). For more on the topic, check out Gabry and colleagues' [-@gabry2019visualization] [*Visualization in Bayesian workflow*](https://arxiv.org/abs/1709.01449) or Simpson's related blog post, [*Touch me, I want to feel your data*](https://statmodeling.stat.columbia.edu/2017/09/07/touch-want-feel-data/).
The code below will make our version of Figure 5.5.
```{r, fig.width = 3, fig.height = 3}
fitted(b5.3) %>%
data.frame() %>%
# un-standardize the model predictions
mutate_all(~. * sd(d$Divorce) + mean(d$Divorce)) %>%
bind_cols(d) %>%
ggplot(aes(x = Divorce, y = Estimate)) +
geom_abline(linetype = 2, color = "grey50", size = .5) +
geom_point(size = 1.5, color = "firebrick4", alpha = 3/4) +
geom_linerange(aes(ymin = Q2.5, ymax = Q97.5),
size = 1/4, color = "firebrick4") +
geom_text(data = . %>% filter(Loc %in% c("ID", "UT", "RI", "ME")),
aes(label = Loc),
hjust = 1, nudge_x = - 0.25) +
labs(x = "Observed divorce", y = "Predicted divorce") +
theme_bw() +
theme(panel.grid = element_blank())
```
> It's easy to see from this arrangement of the simulations that the model under-predicts for States with very high divorce rates while it over-predicts for States with very low divorce rates. That's normal. This is what regression does--it is skeptical of extreme values, so it expects regression towards the mean. But beyond this general regression to the mean, some States are very frustrating to the model, lying very far from the diagonal. (p. 139)
##### Rethinking: Stats, huh, yeah what is it good for?
> Often people want statistical modeling to do things that statistical modeling cannot do. For example, we'd like to know whether an effect is "real" or rather spurious. Unfortunately, modeling merely quantifies uncertainty in the precise way that the model understands the problem. Usually answers to large world questions about truth and causation depend upon information not included in the model. For example, any observed correlation between an outcome and predictor could be eliminated or reversed once another predictor is added to the model. But if we cannot think of the right variable, we might never notice. Therefore all statistical models are vulnerable to and demand critique, regardless of the precision of their estimates and apparent accuracy of their predictions. (p. 139)
##### Overthinking: Simulating spurious association.
```{r}
n <- 100 # number of cases
set.seed(5) # setting the seed makes the results reproducible
d_spur <-
tibble(x_real = rnorm(n), # x_real as Gaussian with mean 0 and SD 1 (i.e., the defaults)
x_spur = rnorm(n, x_real), # x_spur as Gaussian with mean = x_real
y = rnorm(n, x_real)) # y as Gaussian with mean = x_real
```
Here are the quick `pairs()` plots.
```{r, fig.width = 4.5, fig.height = 4.5}
pairs(d_spur, col = "firebrick4")
```
We may as well fit and evaluate a model.
```{r b5.0_spur}
b5.0_spur <-
brm(data = d_spur,
family = gaussian,
y ~ 1 + x_real + x_spur,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/b05.00_spur")
```
```{r}
fixef(b5.0_spur) %>%
round(digits = 2)
```
If we let "r" stand for `x_rel` and "s" stand for `x_spur`, here's how we might depict that our simulation in a DAG.
```{r, fig.width = 3, fig.height = 1.5}
dag_coords <-
tibble(name = c("r", "s", "y"),
x = c(1, 3, 2),
y = c(2, 2, 1))
dagify(s ~ r,
y ~ r,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(color = "firebrick", alpha = 1/4, size = 10) +
geom_dag_text(color = "firebrick") +
geom_dag_edges(edge_color = "firebrick") +
scale_x_continuous(NULL, breaks = NULL, expand = c(0.1, 0.1)) +
scale_y_continuous(NULL, breaks = NULL, expand = c(0.2, 0.2)) +
theme_bw() +
theme(panel.grid = element_blank())
```
#### Counterfactual plots.
> A second sort of inferential plot displays the causal implications of the model. I call these plots **counterfactual**, because they can be produced for any values of the predictor variables you like, even unobserved combinations like very high median age of marriage and very high marriage rate. There are no States with this combination, but in a counterfactual plot, you can ask the model for a prediction for such a State. (p. 140, **emphasis** in the original)
Take another look at one of the DAGs from back in Section 5.1.2.
```{r, fig.width = 3, fig.height = 1.5}
dag_coords <-
tibble(name = c("A", "M", "D"),
x = c(1, 3, 2),
y = c(2, 2, 1))
dagify(M ~ A,
D ~ A + M,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(color = "firebrick", alpha = 1/4, size = 10) +
geom_dag_text(color = "firebrick") +
geom_dag_edges(edge_color = "firebrick") +
scale_x_continuous(NULL, breaks = NULL, expand = c(0.1, 0.1)) +
scale_y_continuous(NULL, breaks = NULL, expand = c(0.2, 0.2)) +
theme_bw() +
theme(panel.grid = element_blank())
```
The full statistical model implied in this DAG requires we have two criterion variables, $D$ and $M$. To simultaneously model the effects of $A$ on $M$ and $D$ AND the effects of $A$ on $M$ with **brms**, we'll need to invoke the multivariate syntax. There are several ways to do this with **brms**, which Bürkner outlines in his [-@Bürkner2022Multivariate] vignette, [*Estimating multivariate models with brms*](https://CRAN.R-project.org/package=brms/vignettes/brms_multivariate.html). At this point, it's important to recognize we have two regression models. As a first step, we might specify each model separately in a `bf()` function and save them as objects.
```{r}
d_model <- bf(d ~ 1 + a + m)
m_model <- bf(m ~ 1 + a)
```
Next we will combine our `bf()` objects with the `+` operator within the `brm()` function. For a model like this, we also specify `set_rescor(FALSE)` to prevent **brms** from adding a residual correlation between `d` and `m`. Also, notice how each prior statement includes a `resp` argument. This clarifies which sub-model the prior refers to.
```{r b5.3_A}
b5.3_A <-
brm(data = d,
family = gaussian,
d_model + m_model + set_rescor(FALSE),
prior = c(prior(normal(0, 0.2), class = Intercept, resp = d),
prior(normal(0, 0.5), class = b, resp = d),
prior(exponential(1), class = sigma, resp = d),
prior(normal(0, 0.2), class = Intercept, resp = m),
prior(normal(0, 0.5), class = b, resp = m),
prior(exponential(1), class = sigma, resp = m)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 5,
file = "fits/b05.03_A")
```
Look at the summary.
```{r}
print(b5.3_A)
```
Note our parameters now all have either a `d_` or an `m_` prefix to help clarify which sub-model they were for. The `m_a` row shows how strongly and negatively associated `a` is to `m`. Here's how we might use `predict()` to make our version of the counterfactual plot in the left panel of Figure 5.6.
```{r, fig.width = 3, fig.height = 3}
nd <- tibble(a = seq(from = -2, to = 2, length.out = 30),
m = 0)
p1 <-
predict(b5.3_A,
resp = "d",
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = a, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
geom_smooth(stat = "identity",
fill = "firebrick", color = "firebrick4", alpha = 1/5, size = 1/4) +
labs(subtitle = "Total counterfactual effect of A on D",
x = "manipulated A",
y = "counterfactual D") +
coord_cartesian(ylim = c(-2, 2)) +
theme_bw() +
theme(panel.grid = element_blank())
p1
```
Because the plot is based on a multivariate model, we used the `resp` argument within `predict()` to tell **brms** which of our two criterion variables (`d` or `m`) we were interested in. Unlike McElreath's **R** code 5.20, we included predictor values for both `a` and `m`. This is because **brms** requires we provide values for all predictors in a model when using `predict()`. Even though we set all the `m` values to 0 for the counterfactual, it was necessary to tell `predict()` that's exactly what we wanted.
Let's do that all again, this time making the counterfactual for `d`. While we're at it, we'll combine this subplot with the last one to make the full version of Figure 5.6.
```{r, fig.width = 6, fig.height = 3.5}
nd <- tibble(a = seq(from = -2, to = 2, length.out = 30))
p2 <-
predict(b5.3_A,
resp = "m",
newdata = nd) %>%
data.frame() %>%
bind_cols(nd) %>%
ggplot(aes(x = a, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
geom_smooth(stat = "identity",
fill = "firebrick", color = "firebrick4", alpha = 1/5, size = 1/4) +
labs(subtitle = "Counterfactual effect of A on M",
x = "manipulated A",
y = "counterfactual M") +
coord_cartesian(ylim = c(-2, 2)) +
theme_bw() +
theme(panel.grid = element_blank())
p1 + p2 + plot_annotation(title = "Counterfactual plots for the multivariate divorce model")
```
With our **brms** + **tidyverse** paradigm, we might compute "the expected causal effect of increasing median age at marriage from 20 to 30" (p. 142) like this.
```{r}