-
Notifications
You must be signed in to change notification settings - Fork 16
/
README.Rmd
1271 lines (1003 loc) · 57.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "`spatialRF`: Easy Spatial Regression with Random Forest"
output:
github_document:
toc: true
toc_depth: 2
pandoc_args: --webtex
always_allow_html: yes
---
<!---
[![R-CMD-check](https://github.com/BlasBenito/spatialRF/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/BlasBenito/spatialRF/actions/workflows/R-CMD-check.yaml)
-->
<!-- badges: start -->
[![Devel-version](https://img.shields.io/badge/devel%20version-1.1.3-blue.svg)](https://github.com/blasbenito/spatialRF) [![lifecycle](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![License](https://img.shields.io/badge/license-GPL--3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.en.html) [![DOI](https://zenodo.org/badge/330962704.svg)](https://zenodo.org/badge/latestdoi/330962704)[![CRAN status](https://www.r-pkg.org/badges/version/spatialRF)](https://cran.r-project.org/package=spatialRF)[![CRAN\_Download\_Badge](http://cranlogs.r-pkg.org/badges/grand-total/spatialRF)](https://CRAN.R-project.org/package=spatialRF)
<!-- badges: end -->
# Introduction
The package **spatialRF** facilitates fitting spatial regression models on regular or irregular data with Random Forest. It does so by generating *spatial predictors* that help the model "understand" the spatial structure of the training data with the end goal of minimizing the spatial autocorrelation of the model residuals and offering honest variable importance scores.
Two main methods to generate *spatial predictors* from the distance matrix of the data points are implemented in the package:
- Moran's Eigenvector Maps [(Dray, Legendre, and Peres-Neto 2006)](https://www.sciencedirect.com/science/article/abs/pii/S0304380006000925).
- Distance matrix columns as explanatory variables [(Hengl et al. 2018)](https://peerj.com/articles/5518/).
The package is designed to minimize the code required to fit a spatial model from a training dataset, the names of the response and the predictors, and a distance matrix, as shown below.
```{r, eval=FALSE}
spatial.model <- spatialRF::rf_spatial(
data = your_dataframe,
dependent.variable.name = "your_response_variable",
predictor.variable.names = c("predictor1", "predictor2", ..., "predictorN"),
distance.matrix = your_distance_matrix
)
```
**spatialRF** uses the fast and efficient `ranger` package under the hood [(Wright and Ziegler 2017)](https://arxiv.org/abs/1508.04409), so please, cite the `ranger` package when using `spatialRF`!
This package also provides tools to identify potentially interesting variable interactions, tune random forest hyperparameters, assess model performance on spatially independent data folds, and examine the resulting models via importance plots, response curves, and response surfaces.
# Development
This package is reaching its final form, and big changes are not expected at this stage. However, it has many functions, and even though all them have been tested, only one dataset has been used for those tests. You will find bugs, and something will go wrong almost surely. If you have time to report bugs, please, do so in any of the following ways:
+ Open a new issue in the [Issues GitHub page of the package](https://github.com/BlasBenito/spatialRF/issues).
+ Send me an email explaining the issue and the error messages with enough detail at blasbenito at gmail dot com.
+ Send a direct message to [my twitter account](https://twitter.com/blasbenito) explaining the issue.
I will do my best to solve any issues ASAP!
# Applications
The goal of `spatialRF` is to help fitting *explanatory spatial regression*, where the target is to understand how a set of predictors and the spatial structure of the data influences response variable. Therefore, the spatial analyses implemented in the package can be applied to any spatial dataset, regular or irregular, with a sample size between \~100 and \~5000 cases (the higher end will depend on the RAM memory available), a quantitative or binary (values 0 and 1) response variable, and a more or less large set of predictive variables.
All functions but `rf_spatial()` work with non-spatial data as well if the arguments `distance.matrix` and `distance.thresholds` are not provided In such case, the number of training cases is no longer limited by the size of the distance matrix, and models can be trained with hundreds of thousands of rows. In such case, the spatial autocorrelation of the model's residuals is not assessed.
However, **when the focus is on fitting spatial models**, and due to the nature of the *spatial predictors* used to represent the spatial structure of the training data, **there are many things this package cannot do**:
- Predict model results over raster data.
- Predict a model result over another region with a different spatial structure.
- Work with "big data", whatever that means.
- Imputation or extrapolation (it can be done, but models based on spatial predictors are hardly transferable).
- Take temporal autocorrelation into account (but this is something that might be implemented later on).
If after considering these limitations you are still interested, follow me, I will show you how it works.
# Citation
There is a paper in the making about this package. In the meantime, if you find it useful for your academic work, please cite the `ranger` package as well, it is the true core of `spatialRF`!
*Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01*
*Blas M. Benito (2021). spatialRF: Easy Spatial Regression with Random Forest. R package version 1.1.0. doi: 10.5281/zenodo.4745208. url: https://blasbenito.github.io/spatialRF/*
# Install
The version 1.1.3 can be installed from CRAN:
```{r, message=FALSE, error=FALSE, warning=FALSE}
install.packages("spatialRF")
```
The package can also be installed from GitHub as follows. There are several branches in the repository:
+ `main`: latest stable version (1.1.0 currently).
+ `development`: development version, usually very broken.
+ `v.1.0.9` to `v.1.1.2`: archived versions.
```{r, message = FALSE, eval = FALSE}
remotes::install_github(
repo = "blasbenito/spatialRF",
ref = "main",
force = TRUE,
quiet = TRUE
)
```
There are a few other libraries that will be useful during this tutorial.
```{r, message = FALSE}
library(spatialRF)
library(kableExtra)
library(rnaturalearth)
library(rnaturalearthdata)
library(tidyverse)
library(randomForestExplainer)
library(pdp)
```
# Data requirements
The data required to fit random forest models with `spatialRF` must fulfill several conditions:
+ **The input format is data.frame**. At the moment, tibbles are not fully supported.
+ **The number of rows must be somewhere between 100 and ~5000**, at least if your target is fitting spatial models. This limitation comes from the fact that the distance matrix grows very fast with an increasing number of training records, so for large datasets, there might not be enough RAM in your machine.
+ **The number of predictors should be larger than 3**. Fitting a Random Forest model is moot otherwise.
+ **Factors in the response or the predictors are not explicitly supported in the package**. They may work, or they won't, but in any case, I designed this package for quantitative data alone. However, binary responses with values 0 and 1 are partially supported.
+ **Must be free of `NA`**. You can check if there are NA records with `sum(apply(df, 2, is.na))`. If the result is larger than 0, then just execute `df <- na.omit(df)` to remove rows with empty cells.
+ **Columns cannot have zero variance**. This condition can be checked with `apply(df, 2, var) == 0`. Columns yielding TRUE should be removed.
+ **Columns must not yield `NaN` or `Inf` when scaled**. You can check each condition with `sum(apply(scale(df), 2, is.nan))` and `sum(apply(scale(df), 2, is.infinite))`. If higher than 0, you can find what columns are giving issues with `sapply(as.data.frame(scale(df)), function(x)any(is.nan(x)))` and `sapply(as.data.frame(scale(df)), function(x)any(is.infinite(x)))`. Any column yielding `TRUE` will generate issues while trying to fit models with `spatialRF`.
# Example data
The package includes an example dataset that fulfills the conditions mentioned above, named [`plant_richness_df`](https://blasbenito.github.io/spatialRF/reference/plant_richness_df.html). It is a data frame with plant species richness and predictors for 227 ecoregions in the Americas, and a distance matrix among the ecoregion edges named, well, [`distance_matrix`](https://blasbenito.github.io/spatialRF/reference/distance_matrix.html).
The package follows a convention throughout functions:
+ The argument `data` requires a training data frame.
+ The argument `dependent.variable.name` is the column name of the response variable.
+ The argument `predictor.variable.names` contains the column names of the predictors.
+ The argument `xy` takes a data frame or matrix with two columns named "x" and "y", in that order, with the case coordinates.
+ The argument `distance.matrix` requires a matrix of distances between the cases in `data`.
+ The argument `distance.thresholds` is a numeric vector of distances at with spatial autocorrelation wants to be computed.
It is therefore convenient to define these arguments at the beginning of the workflow.
```{r}
#loading training data and distance matrix from the package
data(plant_richness_df)
data(distance_matrix)
#names of the response variable and the predictors
dependent.variable.name <- "richness_species_vascular"
predictor.variable.names <- colnames(plant_richness_df)[5:21]
#coordinates of the cases
xy <- plant_richness_df[, c("x", "y")]
#distance matrix
distance.matrix <- distance_matrix
#distance thresholds (same units as distance_matrix)
distance.thresholds <- c(0, 1000, 2000, 4000, 8000)
#random seed for reproducibility
random.seed <- 1
```
The response variable of `plant_richness_df` is "richness_species_vascular", that represents the total count of vascular plant species found on each ecoregion. The figure below shows the centroids of each ecoregion along with their associated value of the response variable.
```{r, echo=TRUE, message=FALSE, warning=FALSE, fig.width=6, fig.height=5.5}
world <- rnaturalearth::ne_countries(
scale = "medium",
returnclass = "sf"
)
ggplot2::ggplot() +
ggplot2::geom_sf(
data = world,
fill = "white"
) +
ggplot2::geom_point(
data = plant_richness_df,
ggplot2::aes(
x = x,
y = y,
color = richness_species_vascular
),
size = 2.5
) +
ggplot2::scale_color_viridis_c(
direction = -1,
option = "F"
) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Plant richness") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Plant richness of the American ecoregions") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
```
The predictors (columns 5 to 21) represent diverse factors that may influence plant richness such as sampling bias, the area of the ecoregion, climatic variables, human presence and impact, topography, geographical fragmentation, and features of the neighbors of each ecoregion. The figure below shows the scatterplots of the response variable (y axis) against each predictor (x axis).
**Note:** Every plotting function in the package now allows changing the colors of their main features via specific arguments such as `point.color`, `line.color`, or `fill.color`.
```{r, echo=TRUE, fig.width=10, fig.height=14}
spatialRF::plot_training_df(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
ncol = 3,
point.color = viridis::viridis(100, option = "F"),
line.color = "gray30"
)
```
The function [`plot_training_df_moran()`](https://blasbenito.github.io/spatialRF/reference/plot_training_df_moran.html) helps assessing the spatial autocorrelation of the response variable and the predictors across different distance thresholds. Low Moran's I and p-values equal or larger than 0.05 indicate that there is no spatial autocorrelation for the given variable and distance threshold.
```{r, echo=TRUE, fig.width=8, fig.height=5}
spatialRF::plot_training_df_moran(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance.matrix,
distance.thresholds = distance.thresholds,
fill.color = viridis::viridis(
100,
option = "F",
direction = -1
),
point.color = "gray40"
)
```
# Reducing multicollinearity in the predictors
The functions [`auto_cor()`](https://blasbenito.github.io/spatialRF/reference/auto_cor.html) and [`auto_vif()`](https://blasbenito.github.io/spatialRF/reference/auto_vif.html) help reduce redundancy in the predictors by using different criteria (bivariate R squared vs. [variance inflation factor](https://www.statisticshowto.com/variance-inflation-factor/)), while allowing the user to define an *order of preference*, which can be based either on domain expertise or on a quantitative assessment (e.g., order of preference based on variable importance scores or model coefficients). The preference order is defined as a character vector in the `preference.order` argument of both functions, and does not need to include the names of all predictors, but just the ones the user would like to keep in the analysis.
Notice that I have set `cor.threshold` and `vif.threshold` to low values because the predictors in `plant_richness_df` already have little multicollinearity,. The default values (`cor.threshold = 0.75` and `vif.threshold = 5`) should work well when combined together for any other set of predictors.
```{r}
preference.order <- c(
"climate_bio1_average_X_bias_area_km2",
"climate_aridity_index_average",
"climate_hypervolume",
"climate_bio1_average",
"climate_bio15_minimum",
"bias_area_km2"
)
predictor.variable.names <- spatialRF::auto_cor(
x = plant_richness_df[, predictor.variable.names],
cor.threshold = 0.6,
preference.order = preference.order
) %>%
spatialRF::auto_vif(
vif.threshold = 2.5,
preference.order = preference.order
)
```
The output of `auto_cor()` or `auto_vif()` has the class "variable_selection", which can be used as input in every function having the argument `predictor.variable.names`.
```{r}
names(predictor.variable.names)
```
The slot `selected.variables` contains the names of the selected predictors.
```{r}
predictor.variable.names$selected.variables
```
# Finding promising variable interactions
Random Forests already takes into account variable interactions of the form "variable `a` becomes important when `b` is higher than x". However, Random Forest can also take advantage of variable interactions of the form `a * b`, across the complete ranges of the predictors, as they are commonly defined in regression models, and "interactions" (not the proper name, but used here for simplicity) represented by the first component of a PCA on the predictors `a` and `b`.
The function [`the_feature_engineer()`](https://blasbenito.github.io/spatialRF/reference/the_feature_engineer.html) tests all possible interactions of both types among the most important predictors, and suggesting the ones not correlated among themselves and with the other predictors inducing an increase in the model's R squared (or AUC when the response is binary) on independent data via spatial cross-validation (see `rf_evaluate()`).
```{r, fig.width=13, fig.height=6}
interactions <- spatialRF::the_feature_engineer(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
xy = xy,
importance.threshold = 0.50, #uses 50% best predictors
cor.threshold = 0.60, #max corr between interactions and predictors
seed = random.seed,
repetitions = 100,
verbose = TRUE
)
```
The upper panel in the plot plot above shows the relationship between the interaction and the response variable. It also indicates the gain in R squared (+R2), the importance, in percentage, when used in a model along the other predictors (Imp. (%)), and the maximum Pearson correlation of the interaction with the predictors. The violin-plot shows a comparison of the model with and without the selected interaction made via spatial cross-validation using 100 repetitions (see [`rf_evaluate()`](https://blasbenito.github.io/spatialRF/reference/rf_evaluate.html) and [`rf_compare()`](https://blasbenito.github.io/spatialRF/reference/rf_compare.html) for further details).
The function also returns a data frame with all the interactions considered. The columns are:
+ `interaction.name`: Interactions computed via multiplication are named `a..x..b`, while interactions computed via PCA are named `a..pca..b`.
+ `interaction.importance`: Importance of the interaction expressed as a percentage. If `interaction.importance == 100`, that means that the interaction is the most important predictor in the model fitted with the interaction and the predictors named in `predictor.variable.names`.
+ `interaction.metric.gain`: Difference in R squared (or AUC for models fitting a binary response) between a model with and a model without the interaction.
+ `max.cor.with.predictors`: The maximum Pearson correlation of the interaction with the predictors named in `predictor.variable.names`. Gives an idea of the amount of multicollinearity the interaction introduces in the model.
+ `variable.a.name` and `variable.b.name`: Names of the predictors involved in the interaction.
+ `selected`: `TRUE` if the interaction fulfills the selection criteria (importance higher than a threshold, positive gain in R squared or AUC, and Pearson correlation with other predictors lower than a threshold). The selected interactions have a correlation among themselves always lower than the value of the argument `cor.threshold`.
```{r, echo=TRUE}
kableExtra::kbl(
head(interactions$screening, 10),
format = "html"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
The function returns a data frame with the response variables, the predictors, and the selected interactions that can be used right away as a training data frame. However, the function cannot say whether an interaction *makes sense*, and it is up to the user to choose wisely whether to select an interaction or not. In this particular case, and just for the sake of simplicity, we will be using the resulting data frame as training data.
```{r}
#adding interaction column to the training data
plant_richness_df <- interactions$data
#adding interaction name to predictor.variable.names
predictor.variable.names <- interactions$predictor.variable.names
```
# Fitting a non-spatial Random Forest model with `rf()`
The function [`rf()`](https://blasbenito.github.io/spatialRF/reference/rf.html) is a convenient wrapper for `ranger::ranger()` used in every modelling function of the *spatialRF* package. It takes the training data, the names of the response and the predictors, and optionally (to assess the spatial autocorrelation of the residuals), the distance matrix, and a vector of distance thresholds (in the same units as the distances in **distance_matrix**).
These distance thresholds are the neighborhoods at which the model will check the spatial autocorrelation of the residuals. Their values may depend on the spatial scale of the data, and the ecological system under study.
Notice that here I plug the object `predictor.variable.names`, output of `auto_cor()` and `auto_vif()`, directly into the `predictor.variable.names` argument of the `rf()` function to fit a random forest model.
```{r}
model.non.spatial <- spatialRF::rf(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance.matrix,
distance.thresholds = distance.thresholds,
xy = xy, #not needed by rf, but other functions read it from the model
seed = random.seed,
verbose = FALSE
)
```
The output is a list with several slots containing the information required to interpret the model. The information available in these slots can be plotted (functions named `plot_...()`), printed to screen (`print_...()`) and captured for further analyses (`get_...()`).
## Residuals
The slot **residuals** (`model.non.spatial$residuals`) stores the values of the residuals and the results of the normality and spatial autocorrelation tests, and its content can be plotted with [`plot_residuals_diagnostics()`](https://blasbenito.github.io/spatialRF/reference/plot_residuals_diagnostics.html).
```{r, fig.width=6, fig.height=7}
spatialRF::plot_residuals_diagnostics(
model.non.spatial,
verbose = FALSE
)
```
The upper panels show the results of the normality test (interpretation in the title), the middle panel shows the relationship between the residuals and the fitted values, important to understand the behavior of the residuals, and the lower panel shows the Moran's I of the residuals across distance thresholds and their respective p-values (positive for 0 and 1000 km).
## Variable importance
### Global variable importance
The slot **importance** (`model.non.spatial$variable.importance`) contains the variable importance scores. These can be plotted with [`plot_importance()`](https://blasbenito.github.io/spatialRF/reference/plot_importance.html), printed with [`print_importance()`](https://blasbenito.github.io/spatialRF/reference/print_importance.html), and the dataframe retrieved with [`get_importance()`](https://blasbenito.github.io/spatialRF/reference/get_importance.html)
```{r, fig.width = 6, fig.height=4.5}
spatialRF::plot_importance(
model.non.spatial,
verbose = FALSE
)
```
Variable importance represents the increase in mean error (computed on the out-of-bag data) across trees when a predictor is permuted. Values lower than zero would indicate that the variable performs worse than a random one.
If the argument `scaled.importance = TRUE` is used, the variable importance scores are computed from the scaled data, making the importance scores easier to compare across different models.
The package [`randomForestExplainer`](https://github.com/ModelOriented/randomForestExplainer) offers a couple of interesting options to deepen our understanding on variable importance scores. The first one is `measure_importance()`, which analyzes the forest to find out the average minimum tree depth at which each variable can be found (`mean_min_depth`), the number of nodes in which a variable was selected to make a split (`no_of_nodes`), the number of times the variable was selected as the first one to start a tree (`times_a_root`), and the probability of a variable to be in more nodes than what it would be expected by chance (`p_value`).
```{r}
importance.df <- randomForestExplainer::measure_importance(
model.non.spatial,
measures = c("mean_min_depth", "no_of_nodes", "times_a_root", "p_value")
)
```
```{r, echo=TRUE}
kableExtra::kbl(
importance.df %>%
dplyr::arrange(mean_min_depth) %>%
dplyr::mutate(p_value = round(p_value, 4)),
format = "html"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
### Contribution of predictors to model transferability
The new function `rf_importance()` offers a way to assess to what extent each predictor contributes to model transferability (predictive ability on independent spatial folds measured with `rf_evaluate()`, see below). It does so by comparing the performance of the full model with models fitted without each one of the predictors. The difference in performance between the full model and a model without a given predictor represents the contribution of such predictor to model transferability.
```{r, fig.width = 8, fig.height=4.5, message=FALSE, warning=FALSE}
model.non.spatial <- spatialRF::rf_importance(
model = model.non.spatial
)
```
The function results are added to the "importance" slot of the model.
```{r}
names(model.non.spatial$importance)
```
The data frame "per.variable" contains the columns "importance.cv" (median importance), "importance.cv.mad" (median absolute deviation), "importance.cv.percent" (median importance in percentage), and "importance.cv.percent.mad" (median absolute deviation of the importance in percent). The ggplot object "cv.per.variable.plot" contains the importance plot with the median and the median absolute deviation shown above.
The importance computed by random forest on the out-of-bag data by permutating each predictor (as computed by `rf()`) and the contribution of each predictor to model transferability (as computed by `rf_importance()`) show a moderate correlation, indicating that both importance measures capture different aspects of the effect of the variables on the model results.
```{r, fig.width = 6, fig.height=4.5, message=FALSE, warning=FALSE}
model.non.spatial$importance$per.variable %>%
ggplot2::ggplot() +
ggplot2::aes(
x = importance.oob,
y = importance.cv
) +
ggplot2::geom_point(size = 3) +
ggplot2::theme_bw() +
ggplot2::xlab("Importance (out-of-bag)") +
ggplot2::ylab("Contribution to transferability") +
ggplot2::geom_smooth(method = "lm", formula = y ~ x, color = "red4")
```
### Local variable importance
Random forest also computes the average increase in error when a variable is permuted for each case. This is named "local importance", is stored in `model.non.spatial$importance$local` as a data frame, and can be retrieved with [`get_importance_local()`](https://blasbenito.github.io/spatialRF/reference/get_importance_local.html).
```{r}
local.importance <- spatialRF::get_importance_local(model.non.spatial)
```
The table below shows the first few records and columns. Larger values indicate larger average errors when estimating a case with the permuted version of the variable, so more important variables will show larger values.
```{r, echo=TRUE}
kableExtra::kbl(
round(local.importance[1:10, 1:5], 0),
format = "html"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
When case coordinates are joined with the local importance scores, it is possible to draw maps showing how variable importance changes over space.
```{r}
#adding coordinates
local.importance <- cbind(
xy,
local.importance
)
```
```{r, echo=TRUE, fig.width=8, fig.height=5}
#colors
color.low <- viridis::viridis(
3,
option = "F"
)[2]
color.high <- viridis::viridis(
3,
option = "F"
)[1]
#plot of climate_bio1_average
p1 <- ggplot2::ggplot() +
ggplot2::geom_sf(
data = world,
fill = "white"
) +
ggplot2::geom_point(
data = local.importance,
ggplot2::aes(
x = x,
y = y,
color = climate_bio1_average
)
) +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::scale_color_gradient2(
low = color.low,
high = color.high
) +
ggplot2::theme_bw() +
ggplot2::theme(legend.position = "bottom") +
ggplot2::ggtitle("climate_bio1_average") +
ggplot2::theme(
plot.title = ggplot2::element_text(hjust = 0.5),
legend.key.width = ggplot2::unit(1,"cm")
) +
ggplot2::labs(color = "Importance") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
p2 <- ggplot2::ggplot() +
ggplot2::geom_sf(
data = world,
fill = "white"
) +
ggplot2::geom_point(
data = local.importance,
ggplot2::aes(
x = x,
y = y,
color = human_population
)
) +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::scale_color_gradient2(
low = color.low,
high = color.high
) +
ggplot2::theme_bw() +
ggplot2::theme(legend.position = "bottom") +
ggplot2::ggtitle("human_population") +
ggplot2::theme(
plot.title = ggplot2::element_text(hjust = 0.5),
legend.key.width = ggplot2::unit(1,"cm")
) +
ggplot2::labs(color = "Importance") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
p1 + p2
```
In these maps, values lower than 0 indicate that for a given record, the permuted version of the variable led to an accuracy score even higher than the one of the non-permuted variable, so again these negative values can be interpreted as "worse than chance".
## Response curves and surfaces
The variable importance scores are also used by the function [`plot_response_curves()`](https://blasbenito.github.io/spatialRF/reference/plot_response_curves.html) to plot partial dependence curves for the predictors (by default, only the ones with an importance score above the median). Building the partial dependency curve of a predictor requires setting the other predictors to their quantiles (0.1, 0.5, and 0.9 by default). This helps to understand how the response curve of a variable changes when all the other variables have low, centered, or high values. The function also allows to see the training data
```{r, fig.width = 9, fig.height=7}
spatialRF::plot_response_curves(
model.non.spatial,
quantiles = c(0.1, 0.5, 0.9),
line.color = viridis::viridis(
3, #same number of colors as quantiles
option = "F",
end = 0.9
),
ncol = 3,
show.data = TRUE
)
```
Setting the argument `quantiles` to 0.5 and setting `show.data` to `FALSE` (default optioin) accentuates the shape of the response curves.
```{r, fig.width = 9, fig.height=7}
spatialRF::plot_response_curves(
model.non.spatial,
quantiles = 0.5,
ncol = 3
)
```
The package [`pdp`](https://bgreenwell.github.io/pdp/index.html) provides a general way to plot partial dependence plots.
```{r, fig.width=4, fig.height=3}
pdp::partial(
model.non.spatial,
train = plant_richness_df,
pred.var = "climate_bio1_average",
plot = TRUE,
grid.resolution = 200
)
```
If you need to do your own plots in a different way, the function [`get_response_curves()`](https://blasbenito.github.io/spatialRF/reference/get_response_curves.html) returns a data frame with the required data.
```{r}
reponse.curves.df <- spatialRF::get_response_curves(model.non.spatial)
```
```{r, echo=TRUE, warning = FALSE, message=FALSE}
kableExtra::kbl(
head(reponse.curves.df, n = 10),
format = "html"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
Interactions between two variables can be plotted with [`plot_response_surface()`](https://blasbenito.github.io/spatialRF/reference/plot_response_surface.html)
```{r, fig.height = 3.2, fig.width=4.5}
spatialRF::plot_response_surface(
model.non.spatial,
a = "climate_bio1_average",
b = "neighbors_count"
)
```
This can be done as well with the `pdp` package, that uses a slightly different algorithm to plot interaction surfaces.
```{r, fig.height = 3.2, fig.width=4.5}
pdp::partial(
model.non.spatial,
train = plant_richness_df,
pred.var = c("climate_bio1_average", "neighbors_count"),
plot = TRUE
)
```
## Model performance
The **performance** slot (in `model.non.spatial$performance`) contains the values of several performance measures. It be printed via the function [`print_performance()`](https://blasbenito.github.io/spatialRF/reference/print_performance.html).
```{r}
spatialRF::print_performance(model.non.spatial)
```
+ `R squared (oob)` and `RMSE (oob)` are the R squared of the model and its root mean squared error when predicting the out-of-bag data (fraction of data not used to train individual trees). From all the values available in the `performance` slot, probably these the most honest ones, as it is the closer trying to get a performance estimate on independent data. However, out-of-bag data is not fully independent, and therefore will still be inflated, especially if the data is highly aggregated in space.
+ `R squared` and `pseudo R squared` are computed from the observations and the predictions, and indicate to what extent model outcomes represent the input data. These values will usually be high the data is highly aggregated in space.
+ The `RMSE` and its normalized version are computed via [`root_mean_squared_error()`](https://blasbenito.github.io/spatialRF/reference/root_mean_squared_error.html), and are linear with `R squared` and `pseudo R squared`.
## Spatial cross-validation
The function [rf_evaluate()](https://blasbenito.github.io/spatialRF/reference/rf_evaluate.html) overcomes the limitations of the performance scores explained above by providing honest performance based on *spatial cross-validation*. The function separates the data into a number of spatially independent training and testing folds. Then, it fits a model on each training fold, predicts over each testing fold, and computes statistics of performance measures across folds. Let's see how it works.
```{r}
model.non.spatial <- spatialRF::rf_evaluate(
model = model.non.spatial,
xy = xy, #data coordinates
repetitions = 30, #number of spatial folds
training.fraction = 0.75, #training data fraction on each fold
metrics = "r.squared",
seed = random.seed,
verbose = FALSE
)
```
The function generates a new slot in the model named **evaluation** (`model.non.spatial$evaluation`) with several objects that summarize the spatial cross-validation results.
```{r}
names(model.non.spatial$evaluation)
```
The slot "spatial.folds", produced by [`make_spatial_folds()`](https://blasbenito.github.io/spatialRF/reference/make_spatial_folds.html), contains the indices of the training and testing cases for each cross-validation repetition. The maps below show two sets of training and testing folds.
```{r, echo=TRUE, fig.width=10, fig.height=5}
pr <- plant_richness_df[, c("x", "y")]
pr$group.2 <- pr$group.1 <- "Training"
pr[model.non.spatial$evaluation$spatial.folds[[1]]$testing, "group.1"] <- "Testing"
pr[model.non.spatial$evaluation$spatial.folds[[25]]$testing, "group.2"] <- "Testing"
p1 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(data = pr,
ggplot2::aes(
x = x,
y = y,
color = group.1
),
size = 2
) +
ggplot2::scale_color_viridis_d(
direction = -1,
end = 0.5,
alpha = 0.8,
option = "F"
) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Group") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Spatial fold 1") +
ggplot2::theme(
legend.position = "none",
plot.title = ggplot2::element_text(hjust = 0.5)
) +
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
p2 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(data = pr,
ggplot2::aes(
x = x,
y = y,
color = group.2
),
size = 2
) +
ggplot2::scale_color_viridis_d(
direction = -1,
end = 0.5,
alpha = 0.8,
option = "F"
) +
ggplot2::theme_bw() +
ggplot2::labs(color = "Group") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::theme(
plot.title = ggplot2::element_text(hjust = 0.5)
) +
ggplot2::ggtitle("Spatial fold 25") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("")
p1 | p2
```
The information available in this new slot can be accessed with the functions [`print_evaluation()`](https://blasbenito.github.io/spatialRF/reference/print_evaluation.html), [`plot_evaluation()`](https://blasbenito.github.io/spatialRF/reference/plot_evaluation.html), and [`get_evaluation()`](https://blasbenito.github.io/spatialRF/reference/get_evaluation.html).
```{r, fig.height = 2, fig.width=4.5}
spatialRF::plot_evaluation(model.non.spatial)
```
`Full` represents the R squared of the model trained on the full dataset. `Training` are the R-squared of the models fitted on the spatial folds (named `Training` in the maps above), and `Testing` are the R-squared of the same models on "unseen" data (data not used to train the model, named `Testing` in the maps above). The median, median absolute deviation (MAD), minimum, and maximum R-squared values on the testing folds can be printed with `print_evaluation()`.
```{r, fig.height = 2, fig.width=4.5}
spatialRF::print_evaluation(model.non.spatial)
```
## Other important things stored in the model
The model predictions are stored in the slot **predictions**, the arguments used to fit the model in **ranger.arguments**, and the model itself, used to predict new values (see code chunk below), is in the **forest** slot.
```{r}
predicted <- stats::predict(
object = model.non.spatial,
data = plant_richness_df,
type = "response"
)$predictions
```
# Fitting a spatial model with `rf_spatial()`
The spatial autocorrelation of the residuals of a model like `model.non.spatial`, measured with [Moran's I](https://en.wikipedia.org/wiki/Moran%27s_I), can be plotted with [`plot_moran()`](https://blasbenito.github.io/spatialRF/reference/plot_moran.html).
```{r, fig.width=4, fig.height=2.5, message=FALSE, warning=FALSE}
spatialRF::plot_moran(
model.non.spatial,
verbose = FALSE
)
```
According to the plot, the spatial autocorrelation of the residuals of `model.non.spatial` is highly positive for a neighborhood of 0 and 1000 km, while it becomes non-significant (p-value \> 0.05) at 2000, 4000, and 8000 km. To reduce the spatial autocorrelation of the residuals as much as possible, the non-spatial model can be transformed into a *spatial model* very easily with the function [`rf_spatial()`](https://blasbenito.github.io/spatialRF/reference/rf_spatial.html). This function is the true core of the package!
```{r}
model.spatial <- spatialRF::rf_spatial(
model = model.non.spatial,
method = "mem.moran.sequential", #default method
verbose = FALSE,
seed = random.seed
)
```
The plot below shows the Moran's I of the residuals of the spatial model, and indicates that the residuals are not autocorrelated at any distance.
```{r, fig.width=4, fig.height=2.5, message=FALSE, warning=FALSE}
spatialRF::plot_moran(
model.spatial,
verbose = FALSE
)
```
If we compare the variable importance plots of both models, we can see that the spatial model has an additional set of dots under the name "spatial_predictors", and that the maximum importance of a few of these *spatial predictors* matches the importance of the most relevant non-spatial predictors.
```{r, fig.width=10, fig.height=4.5}
p1 <- spatialRF::plot_importance(
model.non.spatial,
verbose = FALSE) +
ggplot2::ggtitle("Non-spatial model")
p2 <- spatialRF::plot_importance(
model.spatial,
verbose = FALSE) +
ggplot2::ggtitle("Spatial model")
p1 | p2
```
If we look at the ten most important variables in `model.spatial` we will see that a few of them are *spatial predictors*. Spatial predictors are named `spatial_predictor_X_Y`, where `X` is the neighborhood distance at which the predictor has been generated, and `Y` is the index of the predictor.
```{r, echo=TRUE, warning = FALSE, message=FALSE}
kableExtra::kbl(
head(model.spatial$importance$per.variable, n = 10),
format = "html"
) %>%
kableExtra::kable_paper("hover", full_width = F)
```
But what are spatial predictors? Spatial predictors, as shown below, are smooth surfaces representing neighborhood among records at different spatial scales. They are computed from the distance matrix in different ways. The ones below are the eigenvectors of the double-centered distance matrix of weights (a.k.a, Moran's Eigenvector Maps). They represent the effect of spatial proximity among records, helping to represent biogeographic and spatial processes not considered by the non-spatial predictors.
```{r, echo=TRUE, fig.width=12, fig.height=7}
spatial.predictors <- spatialRF::get_spatial_predictors(model.spatial)
pr <- data.frame(spatial.predictors, plant_richness_df[, c("x", "y")])
p1 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(
data = pr,
ggplot2::aes(
x = x,
y = y,
color = spatial_predictor_0_2
),
size = 2.5
) +
ggplot2::scale_color_viridis_c(option = "F") +
ggplot2::theme_bw() +
ggplot2::labs(color = "Eigenvalue") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Variable: spatial_predictor_0_2") +
ggplot2::theme(legend.position = "bottom")+
ggplot2::xlab("Longitude") +
ggplot2::ylab("Latitude")
p2 <- ggplot2::ggplot() +
ggplot2::geom_sf(data = world, fill = "white") +
ggplot2::geom_point(
data = pr,
ggplot2::aes(
x = x,
y = y,
color = spatial_predictor_0_5,
),
size = 2.5
) +
ggplot2::scale_color_viridis_c(option = "F") +
ggplot2::theme_bw() +
ggplot2::labs(color = "Eigenvalue") +
ggplot2::scale_x_continuous(limits = c(-170, -30)) +
ggplot2::scale_y_continuous(limits = c(-58, 80)) +
ggplot2::ggtitle("Variable: spatial_predictor_0_5") +
ggplot2::theme(legend.position = "bottom") +
ggplot2::xlab("Longitude") +
ggplot2::ylab("")
p1 | p2
```
The spatial predictors are included in the model one by one, in the order of their Moran's I (spatial predictors with Moran's I lower than 0 are removed). The selection procedure is performed by the function [`select_spatial_predictors_sequential()`](https://blasbenito.github.io/spatialRF/reference/select_spatial_predictors_sequential.html), which finds the smaller subset of spatial predictors maximizing the model's R squared, and minimizing the Moran's I of the residuals. This is shown in the optimization plot below (dots linked by lines represent the selected spatial predictors).
```{r, echo=TRUE, fig.width=6, fig.height=4}
p <- spatialRF::plot_optimization(model.spatial)
```
# Tuning Random Forest hyperparameters
The model fitted above was based on the default random forest hyperparameters of `ranger()`, and those might not be the most adequate ones for a given dataset. The function [`rf_tuning()`](https://blasbenito.github.io/spatialRF/reference/rf_tuning.html) helps the user to choose sensible values for three Random Forest hyperparameters that are critical to model performance:
- `num.trees`: number of regression trees in the forest.
- `mtry`: number of variables to choose from on each tree split.
- `min.node.size`: minimum number of cases on a terminal node.
These values can be modified in any model fitted with the package using the `ranger.arguments` argument. The example below shows how to fit a spatial model with a given set of hyperparameters.
```{r, eval = FALSE}
model.spatial <- spatialRF::rf_spatial(
model = model.non.spatial,
method = "mem.moran.sequential", #default method
ranger.arguments = list(
mtry = 5,
min.node.size = 20,
num.trees = 1000
),
verbose = FALSE,
seed = random.seed
)
```
The usual method for model tuning relies on a grid search exploring the results of all the combinations of hyperparameters selected by the user. In `spatialRF`,
model tuning is done via spatial cross-validation, to ensure that the selected combination of hyperparameters maximizes the ability of the model to predict over data not used to train it. **Warning**: model tuning consumes a lot of computational resources, using it on large datasets might freeze your computer.
**WARNING**: model tunning is very RAM-hungry, but you can control RAM usage by defining a lower value for the argument `n.cores`.
```{r, fig.width=6, fig.height=4.5, eval = FALSE}
model.spatial <- rf_tuning(
model = model.spatial,
xy = xy,
repetitions = 30,
num.trees = c(500, 1000),
mtry = seq(
2,
length(model.spatial$ranger.arguments$predictor.variable.names), #number of predictors
by = 9),
min.node.size = c(5, 15),
seed = random.seed,
verbose = FALSE
)
```
The function returns a tuned model only if the tuning finds a solution better than the original model. Otherwise the original model is returned. The results of the tuning are stored in the model under the name "tuning".
# Repeating a model execution
Random Forest is an stochastic algorithm that yields slightly different results on each run unless a random seed is set. This particularity has implications for the interpretation of variable importance scores and response curves. The function [`rf_repeat()`](https://blasbenito.github.io/spatialRF/reference/rf_repeat.html) repeats a model execution and yields the distribution of importance scores of the predictors across executions. **NOTE**: this function works better when used at the end of a workflow
```{r, fig.width = 6, fig.height=4.5}
model.spatial.repeat <- spatialRF::rf_repeat(
model = model.spatial,
repetitions = 30,
seed = random.seed,
verbose = FALSE
)
```
The importance scores of a model fitted with `rf_repeat()` are plotted as a violin plot, with the distribution of the importance scores of each predictor across repetitions.
```{r, fig.width = 6, fig.height=5}
spatialRF::plot_importance(
model.spatial.repeat,
verbose = FALSE
)
```
The response curves of models fitted with `rf_repeat()` can be plotted with `plot_response_curves()` as well. The median prediction is shown with a thicker line.
```{r, fig.width = 9, fig.height=7}
spatialRF::plot_response_curves(
model.spatial.repeat,
quantiles = 0.5,
ncol = 3
)
```
The function `print_performance()` generates a summary of the performance scores across model repetitions. As every other function of the package involving repetitions, the provided stats are the median, and the median absolute deviation (mad).
```{r}
spatialRF::print_performance(model.spatial.repeat)
```
# Taking advantage of the ` %>%` pipe
The modeling functions of `spatialRF` are designed to facilitate using the pipe to combine them. The code below fits a spatial model, tunes its hyperparameters, evaluates it using spatial cross-validation, and repeats the execution several times, just by passing the model from one function to another. Replace `eval = FALSE` with `eval = TRUE` if you want to execute the code chunk.
```{r, eval = FALSE}
model.full <- rf_spatial(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance_matrix,
distance.thresholds = distance.thresholds,
xy = xy
) %>%
rf_tuning() %>%
rf_evaluate() %>%
rf_repeat()
```
The code structure shown above can also be used to take advantage of a custom cluster, either defined in the local host, or a Beowulf cluster.
When working with a single machine, a cluster can be defined and used as follows:
```{r, eval = FALSE}
#creating and registering the cluster
local.cluster <- parallel::makeCluster(
parallel::detectCores() - 1,
type = "PSOCK"
)
doParallel::registerDoParallel(cl = local.cluster)
#fitting, tuning, evaluating, and repeating a model
model.full <- rf_spatial(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance_matrix,
distance.thresholds = distance.thresholds,
xy = xy,
cluster = local.cluster #is passed via pipe to the other functions
) %>%
rf_tuning() %>%
rf_evaluate() %>%
rf_repeat()
#stopping the cluster
parallel::stopCluster(cl = local.cluster)
```
To facilitate working with Beowulf clusters ([just several computers connected via SSH](https://www.blasbenito.com/post/01_home_cluster/)), the package provides the function `beowulf_cluster()`, that generates the cluster definition from details such as the IPs of the machines, the number of cores to be used on each machine, the user name, and the connection port.
```{r, eval = FALSE}
#creating and registering the cluster
beowulf.cluster <- beowulf_cluster(
cluster.ips = c(
"10.42.0.1",
"10.42.0.34",
"10.42.0.104"