-
Notifications
You must be signed in to change notification settings - Fork 39
/
06.Rmd
1522 lines (1168 loc) · 60.5 KB
/
06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 100)
```
# The Haunted DAG & The Causal Terror
Read this opening and cry:
> It seems like the most newsworthy scientific studies are the least trustworthy. The more likely it is to kill you, if true, the less likely it is to be true. The more boring the topic, the more rigorous the results. How could this widely believed negative correlation exist? There doesn't seem to be any reason for studies of topics that people care about to produce less reliable results. Maybe popular topics attract more and worse researchers, like flies drawn to the smell of honey?
> Actually all that is necessary for such a negative correlation to arise is that peer reviewers care about both newsworthiness and trustworthiness. Whether it is grant review or journal review, if editors and reviewers care about both, then the act of selection itself is enough to make the most newsworthy studies the least trustworthy....
>
> Strong selection induces a negative correlation among the criteria used in selection. Why? If the only way to cross the threshold is to score high, it is more common to score high on one item than on both. Therefore among funded proposals, the most newsworthy studies can actually have less than average trustworthiness (less than 0 in the figure). Similarly the most trustworthy studies can be less newsworthy than average.
>
> **Berkson's paradox**. But it is easier to remember if we call it the *selection-distortion effect*. Once you appreciate this effect, you'll see it everywhere....
>
> The selection-distortion effect can happen inside of a multiple regression, because the act of adding a predictor induces statistical selection within the model, a phenomenon that goes by the unhelpful name **collider bias**. This can mislead us into believing, for example, that there is a negative association between newsworthiness and trustworthiness in general, when in fact it is just a consequence of conditioning on some variable. This is both a deeply confusing fact and one that is important to understand in order to regress responsibly.
> This chapter and the next are both about terrible things that can happen when we simply add variables to a regression, without a clear idea of a causal model. [@mcelreathStatisticalRethinkingBayesian2020, pp. 161--162, **emphasis** in the original]
#### Overthinking: Simulated science distortion.
First let's run the simulation.
```{r, warning = F, message = F}
library(tidyverse)
set.seed(1914)
n <- 200 # number of grant proposals
p <- 0.1 # proportion to select
d <-
# uncorrelated newsworthiness and trustworthiness
tibble(newsworthiness = rnorm(n, mean = 0, sd = 1),
trustworthiness = rnorm(n, mean = 0, sd = 1)) %>%
# total_score
mutate(total_score = newsworthiness + trustworthiness) %>%
# select top 10% of combined scores
mutate(selected = ifelse(total_score >= quantile(total_score, 1 - p), TRUE, FALSE))
head(d)
```
Here's the correlation among those cases for which `selected == TRUE`.
```{r}
d %>%
filter(selected == TRUE) %>%
select(newsworthiness, trustworthiness) %>%
cor()
```
For the plots in this chapter, we'll take some aesthetic cues from Aki Vehtari's great GitHub repo, [*Bayesian Data Analysis R Demos*](https://github.com/avehtari/BDA_R_demos).
```{r}
theme_set(theme_minimal())
```
Okay, let's make Figure 6.1.
```{r, fig.width = 3.5, fig.height = 3.25, message = F}
# we'll need this for the annotation
text <-
tibble(newsworthiness = c(2, 1),
trustworthiness = c(2.25, -2.5),
selected = c(TRUE, FALSE),
label = c("selected", "rejected"))
d %>%
ggplot(aes(x = newsworthiness, y = trustworthiness, color = selected)) +
geom_point(aes(shape = selected), alpha = 3/4) +
geom_text(data = text,
aes(label = label)) +
geom_smooth(data = . %>% filter(selected == TRUE),
method = "lm", fullrange = T,
color = "lightblue", se = F, linewidth = 1/2) +
scale_color_manual(values = c("black", "lightblue")) +
scale_shape_manual(values = c(1, 19)) +
scale_x_continuous(limits = c(-3, 3.9), expand = c(0, 0)) +
coord_cartesian(ylim = range(d$trustworthiness)) +
theme(legend.position = "none")
```
Why might "the most newsworthy studies might be the least trustworthy?" The selection-distortion effect.
## Multicollinearity
> Multicollinearity means a very strong association between two or more predictor variables. *The raw correlation isn't what matters. Rather what matters is the association, conditional on the other variables in the model.* The consequence of multicollinearity is that the posterior distribution will seem to suggest that none of the variables is reliably associated with the outcome, even if all of the variables are in reality strongly associated with the outcome.
> This frustrating phenomenon arises from the details of how multiple regression works. In fact, there is nothing wrong with multicollinearity. The model will work fine for prediction. You will just be frustrated trying to understand it. The hope is that once you understand multicollinearity, you will better understand regression models in general. (p. 163, *emphasis* added)
For more on this topic, check out [Jan VanHove](https://janhove.github.io/)'s interesting blog post, [*Collinearity isn't a disease that needs curing*](https://janhove.github.io/analysis/2019/09/11/collinearity).
### Multicollinear legs.
Let's simulate some leg data.
```{r}
n <- 100
set.seed(909)
d <-
tibble(height = rnorm(n, mean = 10, sd = 2),
leg_prop = runif(n, min = 0.4, max = 0.5)) %>%
mutate(leg_left = leg_prop * height + rnorm(n, mean = 0, sd = 0.02),
leg_right = leg_prop * height + rnorm(n, mean = 0, sd = 0.02))
```
As you might expect in real-life data, the `leg_left` and `leg_right` columns are **highly** correlated.
```{r}
d %>%
summarise(r = cor(leg_left, leg_right) %>% round(digits = 4))
```
Have you ever even seen a $\rho = .9997$ correlation, before? Here it is in a plot.
```{r, fig.width = 3, fig.height = 3}
d %>%
ggplot(aes(x = leg_left, y = leg_right)) +
geom_point(alpha = 1/2, color = "forestgreen")
```
Load **brms**.
```{r, message = F, warning = F}
library(brms)
```
Here's our attempt to predict `height` with both legs.
```{r b6.1}
b6.1 <-
brm(data = d,
family = gaussian,
height ~ 1 + leg_left + leg_right,
prior = c(prior(normal(10, 100), class = Intercept),
prior(normal(2, 10), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/b06.01")
```
Let's inspect the damage.
```{r}
print(b6.1)
```
That 'Est.Error' column isn't looking too good. But it's easy to miss that, which is why McElreath suggested "a graphical view of the [output] is more useful because it displays the posterior means and [intervals] in a way that allows us with a glance to see that something has gone wrong here" (p. 164).
Here's our coefficient plot using `brms::mcmc_plot()` with a little help from `bayesplot::color_scheme_set()`.
```{r, message = F, warning = F, fig.width = 6.5, fig.height = 1.75}
library(bayesplot)
color_scheme_set("orange")
mcmc_plot(b6.1,
type = "intervals",
prob = .5,
prob_outer = .95,
point_est = "mean") +
labs(title = "The coefficient plot for the two-leg model",
subtitle = "Holy smokes; look at the widths of those betas!") +
theme(axis.text.y = element_text(hjust = 0),
panel.grid.minor = element_blank(),
strip.text = element_text(hjust = 0))
```
Now you can use the `brms::mcmc_plot()` function without explicitly loading the **bayesplot** package. But loading **bayesplot** allows you to set the color scheme with `color_scheme_set()`.
In the middle of page 164, McElreath suggested we might try this again with different seeds. This might be a good place to introduce iteration. Our first step will be to make a custom function that will simulate new data of the same form as above and then immediately fit a model based on `b6.1` to those new data. To speed up the process, we'll use the `update()` function to avoid recompiling the model. Our custom function, sim_and_fit()`, will take two arguments. The `seed` argument will allow us to keep the results reproducible by setting a seed for the data simulation. The `n` argument will allow us, should we wish, to change the sample size.
```{r}
sim_and_fit <- function(seed, n = 100) {
# set up the parameters
n <- n
set.seed(seed)
# simulate the new data
d <-
tibble(height = rnorm(n, mean = 10, sd = 2),
leg_prop = runif(n, min = 0.4, max = 0.5)) %>%
mutate(leg_left = leg_prop * height + rnorm(n, mean = 0, sd = 0.02),
leg_right = leg_prop * height + rnorm(n, mean = 0, sd = 0.02))
# update b6.1 to the new data
fit <- update(b6.1, newdata = d, seed = 6)
}
```
Now use `sim_and_fit()` to make our simulations which correspond to seed values `1:4`. By nesting the `seed` values and the `sim_and_fit()` function within `purrr::map()`, the results will be saved within our tibble, `sim`.
```{r, eval = F}
sim <-
tibble(seed = 1:4) %>%
mutate(post = map(seed, ~ sim_and_fit(.) %>%
as_draws_df()))
```
```{r, echo = F}
# save(sim, file = "sims/06.sim_1.1.rda")
# rm(sim)
load("sims/06.sim_1.1.rda")
```
Take a look at what we did.
```{r}
head(sim)
```
We have a tibble with four rows and two columns. Hopefully it's unsurprising the first column shows our four `seed` values. The second column, `post`, might look odd. If you look back up at the code we used to make `sim`, you'll notice we fed the results from `sim_and_fit()` into the `brms::as_draws_df()` function. Each model object contained 4,000 posterior draws. There are four parameters in the model and, by **brms** default, the `as_draws_df()` function also returns a few other columns, such as a `.draw` index. Anyway, that's why you see each of the four rows of `post` containing `<S3: draws_df>`. When you have data frames nested within data frames, this is called a nested data structure. To unpack the contents of those four nested data frames, you simply call `unnest(sim, post)`. In the next code block, we'll do that within the context of a larger work flow designed to plot the results of each in a faceted coefficient plot. For data of this kind of structure, the `tidybayes::stat_pointinterval()` function will be particularly useful.
```{r, fig.width = 6.5, fig.height = 5, warning = F, message = F}
library(tidybayes)
sim %>%
unnest(post) %>%
pivot_longer(b_Intercept:sigma) %>%
mutate(seed = str_c("seed ", seed)) %>%
ggplot(aes(x = value, y = name)) +
stat_pointinterval(.width = .95, color = "forestgreen") +
labs(x = "posterior", y = NULL) +
theme(axis.text.y = element_text(hjust = 0),
panel.border = element_rect(color = "black", fill = "transparent"),
panel.grid.minor = element_blank(),
strip.text = element_text(hjust = 0)) +
facet_wrap(~ seed, ncol = 1)
```
Though the results varied across iterations, the overall pattern was massive uncertainty in the two $\beta$ parameters. You may be wondering,
> Did the posterior approximation work correctly?
>
> It did work correctly, and the posterior distribution here is the right answer to the question we asked. The problem is the question. Recall that a multiple linear regression answers the question: *What is the value of knowing each predictor, after already knowing all of the other predictors? So in this case, the question becomes: What is the value of knowing each leg's length, after already knowing the other leg's length?*
>
> The answer to this weird question is equally weird, but perfectly logical. (p. 164, *emphasis* in the original)
To further answer this weird question, we might start making the panels from Figure 6.2. Starting with the left panel, this is perhaps the simplest way to plot the bivariate posterior of our two predictor coefficients.
```{r, fig.width = 3, fig.height = 3}
pairs(b6.1, variable = variables(b6.1)[2:3])
```
If you'd like a nicer and more focused attempt, you might have to revert to the `as_draws_df()` function and a little **ggplot2** code.
```{r, fig.width = 3, fig.height = 3}
post <- as_draws_df(b6.1)
post %>%
ggplot(aes(x = b_leg_left, y = b_leg_right)) +
geom_point(color = "forestgreen", alpha = 1/10, size = 1/2)
```
While we're at it, you can make a similar plot with the `mcmc_scatter()` function [see @gabryPlottingMCMCDraws2022, [*Plotting MCMC draws using the bayesplot package*](https://CRAN.R-project.org/package=bayesplot/vignettes/plotting-mcmc-draws.html)].
```{r, fig.width = 3, fig.height = 3}
color_scheme_set("green")
post %>%
mcmc_scatter(pars = c("b_leg_left", "b_leg_right"),
size = 1/2,
alpha = 1/10)
```
But wow, those coefficients look about as highly correlated as the predictors, just with the reversed sign.
```{r, warning = F}
post %>%
summarise(rho = cor(b_leg_left, b_leg_right))
```
On page 165, McElreath clarified that from the perspective of **brms**, this model may as well be
\begin{align*}
y_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = \alpha + (\beta_1 + \beta_2) x_i.
\end{align*}
Accordingly, here's the posterior of the sum of the two regression coefficients, Figure 6.2.b. We'll use `tidybayes::stat_halfeye()` to both plot the density and mark off the posterior median and percentile-based 95% probability intervals at its base.
```{r, fig.width = 3, fig.height = 3, warning = F, message = F}
post %>%
ggplot(aes(x = b_leg_left + b_leg_right, y = 0)) +
stat_halfeye(point_interval = median_qi,
fill = "steelblue", .width = .95) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Sum the multicollinear coefficients",
subtitle = "Marked by the median and 95% PIs")
```
Now we fit the revised model after ditching one of the leg lengths.
```{r b6.2}
b6.2 <-
brm(data = d,
family = gaussian,
height ~ 1 + leg_left,
prior = c(prior(normal(10, 100), class = Intercept),
prior(normal(2, 10), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/b06.02")
```
```{r}
print(b6.2)
```
That posterior $\textit{SD}$ for `leg_left` looks much better. Compare this density to the one in Figure 6.2.b.
```{r, fig.width = 3, fig.height = 3}
as_draws_df(b6.2) %>%
ggplot(aes(x = b_leg_left, y = 0)) +
stat_halfeye(point_interval = median_qi,
fill = "steelblue", .width = .95) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Just one coefficient needed",
subtitle = "Marked by the median and 95% PIs",
x = "only b_leg_left, this time")
```
> The basic lesson is only this: When two predictor variables are very strongly correlated
(conditional on other variables in the model), including both in a model may lead to confusion. The posterior distribution isn't wrong, in such cases. It's telling you that the question you asked cannot be answered with these data. And that's a great thing for a model to say, that it cannot answer your question. And if you are just interested in prediction, you'll find that this leg model makes fine predictions. It just doesn’t make any claims about which leg is more important. (p. 166)
### Multicollinear `milk`.
Multicollinearity arises in real data, too. Load the `milk` data.
```{r, message = F}
data(milk, package = "rethinking")
d <- milk
rm(milk)
```
Now use the handy `rethinking::standardize()` function to standardize our focal variables.
```{r}
d <-
d %>%
mutate(k = rethinking::standardize(kcal.per.g),
f = rethinking::standardize(perc.fat),
l = rethinking::standardize(perc.lactose))
```
We'll follow the text and fit the two univariable models, first. Note our use of the `update()` function.
```{r b6.3}
# k regressed on f
b6.3 <-
brm(data = d,
family = gaussian,
k ~ 1 + f,
prior = c(prior(normal(0, 0.2), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/b06.03")
# k regressed on l
b6.4 <-
update(b6.3,
newdata = d,
formula = k ~ 1 + l,
seed = 6,
file = "fits/b06.04")
```
```{r}
posterior_summary(b6.3)[1:3, ] %>% round(digits = 2)
posterior_summary(b6.4)[1:3, ] %>% round(digits = 2)
```
Now "watch what happens when we place both predictor variables in the same regression model" (p. 167).
```{r b6.5}
b6.5 <-
update(b6.4,
newdata = d,
formula = k ~ 1 + f + l,
seed = 6,
file = "fits/b06.05")
```
Now the $\beta$'s are smaller and less certain.
```{r}
posterior_summary(b6.5)[1:3, ] %>% round(digits = 2)
```
Here's a quick `paris()` plot.
```{r, fig.width = 4, fig.height = 4}
d %>%
select(kcal.per.g, perc.fat, perc.lactose) %>%
pairs(col = "forestgreen")
```
Now here's the custom **GGally** version.
```{r, fig.width = 4, fig.height = 4, message = F}
library(GGally)
# define a couple custom functions
my_diag <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density(fill = "steelblue", color = "black")
}
my_lower <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_smooth(method = "lm", color = "orange", se = F) +
geom_point(alpha = .8, size = 1/3, color = "blue")
}
# plug those custom functions into `ggpairs()`
ggpairs(data = d, columns = c(3:4, 6),
upper = list(continuous = wrap("cor", family = "sans", color = "black")),
diag = list(continuous = my_diag),
lower = list(continuous = my_lower))
```
Our two predictor "variables are negatively correlated, and so strongly so that they are nearly redundant. Either helps in predicting `kcal.per.g`, but neither helps much *once you already know the other*" (p. 169, *emphasis* in the original). You can really see that on the lower two scatter plots. You'll note the `ggpairs()` plot also returned the Pearson's correlation coefficients with their $p$-value stars ($^\star$). Later on in the ebook, we'll practice setting custom correlation functions which avoid displaying that dirty $p$-value stuff.
Anyway, making a DAG might help us make sense of this.
```{r, fig.width = 3, fig.height = 1.5, message = F, warning = F}
library(ggdag)
dag_coords <-
tibble(name = c("L", "D", "F", "K"),
x = c(1, 2, 3, 2),
y = c(2, 2, 2, 1))
dagify(L ~ D,
F ~ D,
K ~ L + F,
coords = dag_coords) %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = name == "D"),
alpha = 1/2, size = 6.5, show.legend = F) +
geom_point(x = 2, y = 2,
size = 6.5, shape = 1, stroke = 1, color = "orange") +
geom_dag_text(color = "black") +
geom_dag_edges() +
scale_color_manual(values = c("steelblue", "orange")) +
scale_x_continuous(NULL, breaks = NULL, expand = c(.1, .1)) +
scale_y_continuous(NULL, breaks = NULL, expand = c(.1, .1))
```
> The central tradeoff decides how dense, $D$, the milk needs to be. We haven't observed this variable, so it's shown circled. Then fat, $F$, and lactose, $L$, are determined. Finally, the composition of $F$ and $L$ determines the kilocalories, $K$. If we could measure $D$, or had an evolutionary and economic model to predict it based upon other aspects of a species, that would be better than stumbling through regressions. (p. 167)
#### Rethinking: Identification guaranteed; comprehension up to you.
If you come from a frequentist background, this may jar you, but
> technically speaking, *identifiability* is not a concern for Bayesian models. The reason is that as long as the posterior distribution is proper--which just means that it integrates to 1--then all of the parameters are identified. But this technical fact doesn't also mean that you can make sense of the posterior distribution. So it's probably better to speak of *weakly identified* parameters in a Bayesian context. (pp. 169--170, *emphasis* in the original)
#### Overthinking: Simulating collinearity.
First we'll get the data and define the functions. You'll note I've defined my `sim_coll()` a little differently from `sim.coll()` in the text. I've omitted `rep.sim.coll()` as an independent function altogether, but computed similar summary information with the `summarise()` code at the bottom of the block.
```{r, warning = F, message = F}
# define a custom function
sim_coll <- function(seed, rho) {
# simulate the data
set.seed(seed)
d <-
d %>%
mutate(x = rnorm(n(),
mean = perc.fat * rho,
sd = sqrt((1 - rho^2) * var(perc.fat))))
# fit an OLS model
m <- lm(kcal.per.g ~ perc.fat + x, data = d)
# extract the parameter SD
sqrt(diag(vcov(m)))[2]
}
# how many simulations per `rho`-value would you like?
n_seed <- 100
# how many `rho`-values from 0 to .99 would you like to evaluate the process over?
n_rho <- 30
d <-
crossing(seed = 1:n_seed,
rho = seq(from = 0, to = .99, length.out = n_rho)) %>%
mutate(parameter_sd = purrr::map2_dbl(seed, rho, sim_coll)) %>%
group_by(rho) %>%
# we'll `summarise()` our output by the mean and 95% intervals
summarise(mean = mean(parameter_sd),
ll = quantile(parameter_sd, prob = .025),
ul = quantile(parameter_sd, prob = .975))
```
We've added 95% interval bands to our version of Figure 5.10.
```{r, fig.width = 3.25, fig.height = 2.75}
d %>%
ggplot(aes(x = rho, y = mean)) +
geom_smooth(aes(ymin = ll, ymax = ul),
stat = "identity",
fill = "orange", color = "orange", alpha = 1/3, linewidth = 2/3) +
labs(x = expression(rho),
y = "parameter SD") +
coord_cartesian(ylim = c(0, .0072))
```
## Post-treatment bias
It helped me understand the next example by mapping out the sequence of events McElreath described in the second paragraph:
* seed and sprout plants
* measure heights
* apply different antifungal soil treatments (i.e., the experimental manipulation)
* measure (a) the heights and (b) the presence of fungus
Based on the design, let's simulate our data.
```{r}
# how many plants would you like?
n <- 100
set.seed(71)
d <-
tibble(h0 = rnorm(n, mean = 10, sd = 2),
treatment = rep(0:1, each = n / 2),
fungus = rbinom(n, size = 1, prob = .5 - treatment * 0.4),
h1 = h0 + rnorm(n, mean = 5 - 3 * fungus, sd = 1))
```
We'll use `head()` to peek at the data.
```{r}
d %>%
head()
```
And here's a quick summary with `tidybayes::mean_qi()`.
```{r}
d %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mean_qi(.width = .89) %>%
mutate_if(is.double, round, digits = 2)
```
If you want to make those cute mini histograms, go back and check out our custom `histospark()` code from [Chapter 4][Geocentric Models].
### A prior is born.
Let's take a look at the $p \sim \operatorname{Log-Normal}(0, 0.25)$ prior distribution.
```{r, fig.width = 6, fig.height = 3.25}
set.seed(6)
# simulate
sim_p <-
tibble(sim_p = rlnorm(1e4, meanlog = 0, sdlog = 0.25))
# wrangle
sim_p %>%
mutate(`exp(sim_p)` = exp(sim_p)) %>%
gather() %>%
# plot
ggplot(aes(x = value)) +
geom_density(fill = "steelblue") +
scale_x_continuous(breaks = c(0, .5, 1, 1.5, 2, 3, 5)) +
scale_y_continuous(NULL, breaks = NULL) +
coord_cartesian(xlim = c(0, 6)) +
theme(panel.grid.minor.x = element_blank()) +
facet_wrap(~ key, scale = "free_y", ncol = 1)
```
Summarize.
```{r}
sim_p %>%
mutate(`exp(sim_p)` = exp(sim_p)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mean_qi(.width = .89) %>%
mutate_if(is.double, round, digits = 2)
```
"This prior expects anything from 40% shrinkage up to 50% growth" (p. 172). So then, our initial statistical model will follow the form
\begin{align*}
h_{1i} & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = h_{0i} \times p \\
p & \sim \operatorname{Log-Normal}(0, 0.25) \\
\sigma & \sim \operatorname{Exponential}(1).
\end{align*}
Let's fit that model.
```{r b6.6,}
b6.6 <-
brm(data = d,
family = gaussian,
h1 ~ 0 + h0,
prior = c(prior(lognormal(0, 0.25), class = b, lb = 0),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/b06.06")
```
Note our use of the `lb` argument to set the lower bound for the prior on $p$. Anyway, behold the summary.
```{r}
print(b6.6)
```
So then, the expectation is an increase of about `r round((fixef(b6.6)[1] - 1) * 100, 0)` percent relative to $h_0$. But this isn't the best model. We're leaving important predictors on the table. Our updated model follows the form
\begin{align*}
h_{1i} & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = h_{0,i} \times p \\
p & = \alpha + \beta_1 \text{treatment}_i + \beta_2 \text{fungus}_i \\
\alpha & \sim \operatorname{Log-Normal}(0, 0.25) \\
\beta_1 & \sim \operatorname{Normal}(0, 0.5) \\
\beta_2 & \sim \operatorname{Normal}(0, 0.5) \\
\sigma & \sim \operatorname{Exponential}(1).
\end{align*}
That is, the "proportion of growth $p$ is now a function of the predictor variables" (p. 172). Although we will fit the equivalent of McElreath's model in **brms**, I'm not aware that we can translate it directly into conventional **brms** syntax. But take a look at the critical two lines from above:
\begin{align*}
\mu_i & = h_{0,i} \times p \\
p & = \alpha + \beta_1 \text{treatment}_i + \beta_2 \text{fungus}_i. \\
\end{align*}
With just a little algebra, we can re-express that as
$$\mu_i = h_{0i} \times (\alpha + \beta_1 \text{treatment}_i + \beta_2 \text{fungus}_i).$$
With **brms**, we fit models like that using the [non-linear syntax](https://CRAN.R-project.org/package=brms/vignettes/brms_nonlinear.html) [@Bürkner2022Non_linear], which we briefly introduced in [Section 4.4.2.1][Overthinking: Logs and exps, oh my.] and [Section 5.3.2][Many categories.]. Yes friends, it's now time we discuss the non-linear **brms** syntax in detail. Bur first, here's what it looks like for `b6.7`.
```{r b6.7}
b6.7 <-
brm(data = d,
family = gaussian,
bf(h1 ~ h0 * (a + t * treatment + f * fungus),
a + t + f ~ 1,
nl = TRUE),
prior = c(prior(lognormal(0, 0.2), nlpar = a, lb = 0),
prior(normal(0, 0.5), nlpar = t),
prior(normal(0, 0.5), nlpar = f),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/b06.07")
```
To explain what's going on with our `formula `syntax, it's probably best to quote Bürkner's [-@Bürkner2022Non_linear] [vignette](https://CRAN.R-project.org/package=brms/vignettes/brms_nonlinear.html) at length:
> When looking at the above code, the first thing that becomes obvious is that we changed the `formula` syntax to display the non-linear formula including predictors (i.e., [`h0`, `treatment`, and `fungus`]) and parameters (i.e., [`a`, `t`, and `f`]) wrapped in a call to [the `bf()` function]. This stands in contrast to classical **R** formulas, where only predictors are given and parameters are implicit. The argument [`a + t + f ~ 1`] serves two purposes. First, it provides information, which variables in `formula` are parameters, and second, it specifies the linear predictor terms for each parameter. In fact, we should think of non-linear parameters as placeholders for linear predictor terms rather than as parameters themselves (see also the following examples). In the present case, we have no further variables to predict [`a`, `t`, and `f`] and thus we just fit intercepts that represent our estimates of [$\alpha$, $t$, and $f$]. The formula [`a + t + f ~ 1`] is a short form of[`a ~ 1, t ~ 1, f ~ 1`] that can be used if multiple non-linear parameters share the same formula. Setting `nl = TRUE` tells **brms** that the formula should be treated as non-linear.
> In contrast to generalized linear models, priors on population-level parameters (i.e., 'fixed effects') are often mandatory to identify a non-linear model. Thus, **brms** requires the user to explicitly specify these priors. In the present example, we used a [`lognormal(0, 0.2)` prior on (the population-level intercept of) `a`, while we used a `normal(0, 0.5)` prior on both (population-level intercepts of) `t` and `f`]. Setting priors is a non-trivial task in all kinds of models, especially in non-linear models, so you should always invest some time to think of appropriate priors. Quite often, you may be forced to change your priors after fitting a non-linear model for the first time, when you observe different MCMC chains converging to different posterior regions. This is a clear sign of an identification problem and one solution is to set stronger (i.e., more narrow) priors. (**emphasis** in the original)
Let's see what we've done.
```{r}
print(b6.7)
```
All in all, it looks like we did a good job matching up McElreath's results. The posterior doesn't, however, match up well with the way we generated the data...
### Blocked by consequence.
To measure the treatment effect properly, we should omit `fungus` from the model. This leaves us with the equation
\begin{align*}
h_{1i} & \sim \operatorname{Normal}(\mu_i, \sigma) \\
\mu_i & = h_{0i} \times (\alpha + \beta_1 \text{treatment}_i) \\
\alpha & \sim \operatorname{Log-Normal}(0, 0.25) \\
\beta_1 & \sim \operatorname{Normal}(0, 0.5) \\
\sigma & \sim \operatorname{Exponential}(1).
\end{align*}
Fit this model with the non-linear syntax, too.
```{r b6.8}
b6.8 <-
brm(data = d,
family = gaussian,
bf(h1 ~ h0 * (a + t * treatment),
a + t ~ 1,
nl = TRUE),
prior = c(prior(lognormal(0, 0.2), nlpar = a, lb = 0),
prior(normal(0, 0.5), nlpar = t),
prior(exponential(1), class = sigma)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/b06.08")
```
Did we do better?
```{r}
print(b6.8)
```
Yes, now we have a positive treatment effect.
### Fungus and $d$-separation.
Let's make a DAG.
```{r, fig.width = 4, fig.height = 1.5, message = F, warning = F}
# define our coordinates
dag_coords <-
tibble(name = c("H0", "T", "F", "H1"),
x = c(1, 5, 4, 3),
y = c(2, 2, 1.5, 1))
# save our DAG
dag <-
dagify(F ~ T,
H1 ~ H0 + F,
coords = dag_coords)
# plot
dag %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(color = "steelblue", alpha = 1/2, size = 6.5) +
geom_dag_text(color = "black") +
geom_dag_edges() +
theme_dag()
```
We'll be making a lot of simple DAGs following this format over this chapter. To streamline out plotting code, let's make a custom plotting function. I'll call it `gg_simple_dag()`.
```{r, fig.width = 4, fig.height = 1.5, message = F, warning = F}
gg_simple_dag <- function(d) {
d %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(color = "steelblue", alpha = 1/2, size = 6.5) +
geom_dag_text(color = "black") +
geom_dag_edges() +
theme_dag()
}
# try it out!
dag %>%
gg_simple_dag()
```
Anyway, the DAG clarifies
> that learning the treatment tells us nothing about the outcome, once we know the fungus status.
>
> An even more DAG way to say this is that conditioning on $F$ induces **d-separation**. The "d" stands for *directional*. D-separation means that some variables on a directed graph are independent of others. There is no path connecting them. In this case, $H_1$ is d-separated from $T$, but only when we condition on $F$. Conditioning on $F$ effectively blocks the directed path $T \rightarrow F \rightarrow H_1$, making $T$ and $H_1$ independent (d-separated). (p. 174, **emphasis** in the original)
Note that our **ggdag** object, `dag`, will also work with the `dagitty::dseparated()` function.
```{r, message = F, warning = F}
library(dagitty)
dag %>%
dseparated("T", "H1")
dag %>%
dseparated("T", "H1", "F")
```
The descriptively-named `dagitty::mpliedConditionalIndependencies()` function will work, too.
```{r}
impliedConditionalIndependencies(dag)
```
Now consider a DAG of a different kind of causal structure.
```{r, fig.width = 4, fig.height = 1.5, message = F, warning = F}
# define our coordinates
dag_coords <-
tibble(name = c("H0", "H1", "M", "F", "T"),
x = c(1, 2, 2.5, 3, 4),
y = c(2, 2, 1, 2, 2))
# save our DAG
dag <-
dagify(F ~ M + T,
H1 ~ H0 + M,
coords = dag_coords)
# plot
dag %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = name == "M"),
alpha = 1/2, size = 6.5, show.legend = F) +
geom_point(x = 2.5, y = 1,
size = 6.5, shape = 1, stroke = 1, color = "orange") +
geom_dag_text(color = "black") +
geom_dag_edges() +
scale_color_manual(values = c("steelblue", "orange")) +
theme_dag()
```
Our custom `gg_simple_dag()` was a little too brittle to accommodate DAGs that mark of unobserved variables. Since we'll be making a few more DAGs of this kind, we'll make one more custom plotting function. We'll call this one `gg_fancy_dag()`.
```{r, fig.width = 4, fig.height = 1.5, message = F, warning = F}
gg_fancy_dag <- function(d, x = 1, y = 1, circle = "U") {
d %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(aes(color = name == circle),
alpha = 1/2, size = 6.5, show.legend = F) +
geom_point(x = x, y = y,
size = 6.5, shape = 1, stroke = 1, color = "orange") +
geom_dag_text(color = "black") +
geom_dag_edges() +
scale_color_manual(values = c("steelblue", "orange")) +
theme_dag()
}
# check it out
dag %>%
gg_fancy_dag(x = 2.5, y = 1, circle = "M")
```
Based on McElreath's **R** code 6.20, here we simulate some data based on the new DAG.
```{r}
set.seed(71)
n <- 1000
d2 <-
tibble(h0 = rnorm(n, mean = 10, sd = 2),
treatment = rep(0:1, each = n / 2),
m = rbinom(n, size = 1, prob = .5),
fungus = rbinom(n, size = 1, prob = .5 - treatment * 0.4 + 0.4 * m),
h1 = h0 + rnorm(n, mean = 5 + 3 * m, sd = 1))
head(d2)
```
Use `update()` to refit `b6.7` and `b6.8` to the new data.
```{r b6.7b, warning = F, message = F}
b6.7b <-
update(b6.7,
newdata = d2,
seed = 6,
file = "fits/b06.07b")
b6.8b <-
update(b6.8,
newdata = d2,
seed = 6,
file = "fits/b06.08b")
```
Check the results.
```{r}
posterior_summary(b6.7b)[1:4, ] %>% round(digits = 2)
posterior_summary(b6.8b)[1:3, ] %>% round(digits = 2)
```
"Including fungus again confounds inference about the treatment, this time by making it seem like it helped the plants, even though it had no effect" (p. 175).
#### Rethinking: Model selection doesn't help.
> In the next chapter, you'll learn about model selection using information criteria. Like other model comparison and selection schemes, these criteria help in contrasting and choosing model structure. But such approaches are no help in the example presented just above, since the model that includes fungus both fits the sample better and would make better out-of-sample predictions. Model [`b6.7`] misleads because it asks the wrong question, not because it would make poor predictions. As argued in Chapter 1, prediction and causal inference are just not the same task. No statistical procedure can substitute for scientific knowledge and attention to it. We need multiple models because they help us understand causal paths, not just so we can choose one or another for prediction. (p. 175)
Brutal.
## Collider bias
Make the collider bias DAG of the trustworthiness/newsworthiness example.
```{r, fig.width = 3, fig.height = 1}
dag_coords <-
tibble(name = c("T", "S", "N"),
x = 1:3,
y = 1)
dagify(S ~ T + N,
coords = dag_coords) %>%
gg_simple_dag()
```
> The fact that two arrows enter S means it is a **collider**. The core concept is easy to understand: When you condition on a collider, it creates statistical--but not necessarily causal--associations among its causes. In this case, once you learn that a proposal has been selected ($S$), then learning its trustworthiness ($T$) also provides information about its newsworthiness ($N$). Why? Because if, for example, a selected proposal has low trustworthiness, then it must have high newsworthiness. Otherwise it wouldn’t have been funded. The same works in reverse: If a proposal has low newsworthiness, we’d infer that it must have higher than average trustworthiness. Otherwise it would not have been selected for funding. (p. 176. **emphasis** in the original)
### Collider of false sorrow.
All it takes is a single `mutate()` line in the `dagify()` function to amend our previous DAG.
```{r, fig.width = 3, fig.height = 1}
dagify(M ~ H + A,
coords = dag_coords %>%
mutate(name = c("H", "M", "A"))) %>%
gg_simple_dag()
```
In this made-up example,
> happiness ($H$) and age ($A$) both cause marriage ($M$). Marriage is therefore a collider. Even though there is no causal association between happiness and age, if we condition on marriage--which means here, if we include it as a predictor in a regression--then it will induce a statistical association between age and happiness. And this can mislead us to think that happiness changes with age, when in fact it is constant.
>
> To convince you of this, let's do another simulation. (pp. 176--177)
```{r, eval = F, echo = F}
rethinking::sim_happiness
```
McElreath simulated the data for this section using his custom `rethinking::sim_happiness()` function. If you'd like to see the guts of the function, execute `rethinking::sim_happiness`. Our approach will be to simulate the data from the ground up. The workflow to follow is based on help from the great [Randall Pruim](https://github.com/rpruim); I was initially stumped and he [lent a helping hand](https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed/issues/26). The first step is to make a simple `new_borns()` function, which returns a tibble with `n` unmarried one-year-old's who have different levels of happiness. We'll set the default for `n` at `20`.
```{r}
new_borns <- function(n = 20) {
tibble(a = 1, # 1 year old
m = 0, # not married
h = seq(from = -2, to = 2, length.out = n)) # range of happiness scores
}
```
Here's how it works.
```{r}
new_borns()
```
The second step is to make another custom function, `update_population()`, which takes the input from `new_borns()`. This function will age up the simulated one-year-old's from `new_borns()`, add another cohort of `new_borns()`, and append the cohorts. As you iterate, the initial cohort of `new_borns()` will eventually hit the age of 18, which is also the age they're first eligible to marry (`aom = 18`). At that age and up, the happier people are more likely to get married than the less happy folks. You'll also see that our simulation follows McElreath's in that we remove people from the population after the age of 65. `r emo::ji("shrug")`
```{r}
update_population <- function(pop, n_births = 20, aom = 18, max_age = 65) {
pop %>%
mutate(a = a + 1, # everyone gets one year older
# some people get married
m = ifelse(m >= 1, 1, (a >= aom) * rbinom(n(), 1, rethinking::inv_logit(h - 4)))) %>%
filter(a <= max_age) %>% # old people die
bind_rows(new_borns(n_births)) # new people are born
}
```
Here's what it looks like if we start with an initial `new_borns()` and pump them into `update_population()`.
```{r}
new_borns() %>%
update_population()
```
For our final step, we run the population simulation for 1,000 years. On my 2-year-old laptop, this took less than a minute.
```{r}
# this was McElreath's seed
set.seed(1977)
# year 1
d <- new_borns(n = 20)
# years 2 through 1000
for(i in 2:1000) {
d <- update_population(d, n_births = 20, aom = 18, max_age = 65)
}
# now rename()
d <-
d %>%
rename(age = a, married = m, happiness = h)
# take a look
glimpse(d)
```
Summarize the variables.
```{r}
d %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mean_qi(value) %>%
mutate_if(is.double, round, digits = 2)
```