forked from geocompx/geocompy
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-spatial-operations.qmd
1344 lines (1067 loc) · 79.6 KB
/
03-spatial-operations.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
jupyter: python3
---
# Spatial data operations {#sec-spatial-operations}
## Prerequisites {.unnumbered}
```{python}
#| echo: false
import book_options
```
```{python .content-visible when-format="pdf"}
#| echo: false
import book_options_pdf
```
This chapter requires importing the following packages:
```{python}
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.ndimage
import scipy.stats
import shapely
import geopandas as gpd
import rasterio
import rasterio.plot
import rasterio.merge
import rasterio.features
```
It also relies on the following data files:
```{python}
#| echo: false
from pathlib import Path
data_path = Path('data')
file_path = Path('data/landsat.tif')
if not file_path.exists():
if not data_path.is_dir():
os.mkdir(data_path)
print('Attempting to get the data')
import requests
r = requests.get('https://github.com/geocompx/geocompy/releases/download/0.1/landsat.tif')
with open(file_path, 'wb') as f:
f.write(r.content)
```
```{python}
nz = gpd.read_file('data/nz.gpkg')
nz_height = gpd.read_file('data/nz_height.gpkg')
world = gpd.read_file('data/world.gpkg')
cycle_hire = gpd.read_file('data/cycle_hire.gpkg')
cycle_hire_osm = gpd.read_file('data/cycle_hire_osm.gpkg')
src_elev = rasterio.open('output/elev.tif')
src_landsat = rasterio.open('data/landsat.tif')
src_grain = rasterio.open('output/grain.tif')
```
## Introduction
Spatial operations, including spatial joins between vector datasets and local and focal operations on raster datasets, are a vital part of geocomputation.
This chapter shows how spatial objects can be modified in a multitude of ways based on their location and shape. Many spatial operations have a non-spatial (attribute) equivalent, so concepts such as subsetting and joining datasets demonstrated in the previous chapter are applicable here.
This is especially true for vector operations: @sec-vector-attribute-manipulation on vector attribute manipulation provides the basis for understanding its spatial counterpart, namely spatial subsetting (covered in @sec-spatial-subsetting-vector).
Spatial joining (@sec-spatial-joining) and aggregation (@sec-vector-spatial-aggregation) also have non-spatial counterparts, covered in the previous chapter.
Spatial operations differ from non-spatial operations in a number of ways, however.
Spatial joins, for example, can be done in a number of ways---including matching entities that intersect with or are within a certain distance of the target dataset---while the attribution joins discussed in @sec-vector-attribute-joining in the previous chapter can only be done in one way.
Different types of spatial relationship between objects, including intersects and disjoint, are described in @sec-topological-relations.
Another unique aspect of spatial objects is distance: all spatial objects are related through space, and distance calculations can be used to explore the strength of this relationship, as described in the context of vector data in @sec-distance-relations.
Spatial operations on raster objects include subsetting---covered in @sec-spatial-subsetting-raster---and merging several raster 'tiles' into a single object, as demonstrated in @sec-merging-rasters.
Map algebra covers a range of operations that modify raster cell values, with or without reference to surrounding cell values.
The concept of map algebra, vital for many applications, is introduced in @sec-map-algebra; local, focal and zonal map algebra operations are covered in sections @sec-raster-local-operations, @sec-focal-operations, and @sec-zonal-operations, respectively.
Global map algebra operations, which generate summary statistics representing an entire raster dataset, and distance calculations on rasters, are discussed in Section @sec-global-operations-and-distances.
::: callout-note
It is important to note that spatial operations that use two spatial objects rely on both objects having the same coordinate reference system, a topic that was introduced in @sec-coordinate-reference-systems-intro and which will be covered in more depth in @sec-reproj-geo-data.
:::
## Spatial operations on vector data {#sec-spatial-vec}
This section provides an overview of spatial operations on vector geographic data represented as Simple Features using the **shapely** and **geopandas**
packages.
@sec-spatial-ras then presents spatial operations on raster datasets, using the **rasterio** and **scipy** packages.
### Spatial subsetting {#sec-spatial-subsetting-vector}
<!-- jn: have we considered switching the order of spatial subsetting with topological relations? -->
<!-- md: I don't think we considered it yet (the order as the moment is the same as in 'geocompr') -->
Spatial subsetting is the process of taking a spatial object and returning a new object containing only features that relate in space to another object.
Analogous to attribute subsetting (covered in @sec-vector-attribute-subsetting), subsets of `GeoDataFrame`s can be created with square bracket (`[`) operator using the syntax `x[y]`, where `x` is an `GeoDataFrame` from which a subset of rows/features will be returned, and `y` is a boolean `Series`.
The difference is, that, in spatial subsetting `y` is created based on another geometry and using one of the binary geometry relation methods, such as `.intersects` (see @sec-topological-relations), rather based on comparison based on ordinary columns.
To demonstrate spatial subsetting, we will use the `nz` and `nz_height` layers, which contain geographic data on the 16 main regions and 101 highest points in New Zealand, respectively (@fig-spatial-subset (a)), in a projected coordinate system.
The following expression creates a new object, `canterbury`, representing only one region --- Canterbury.
```{python}
canterbury = nz[nz['Name'] == 'Canterbury']
canterbury
```
Then, we use the `.intersects` method evaluate, for each of the `nz_height` points, whether they intersect with Canterbury.
The result `canterbury_height` is a boolean `Series` with the "answers".
```{python}
sel = nz_height.intersects(canterbury.geometry.iloc[0])
sel
```
Finally, we can subset `nz_height` using the obtained `Series`, resulting in the subset `canterbury_height` with only those points that intersect with Canterbury.
```{python}
canterbury_height = nz_height[sel]
canterbury_height
```
@fig-spatial-subset compares the original `nz_height` layer (left) with the subset `canterbury_height` (right).
<!-- jn: do we need to repeat the `base = nz.plot(color='white', edgecolor='lightgrey')` code here? -->
<!-- md: The expression to create 'base' has to be repeated. can't say I completely understand the mechanism, but I think that once a 'matplotlib' plot is drawn, the plot objects are emptied, so we have to re-create them when making a new plot. maybe there is a solution to it that I missed, but currently we're creating each plot intependently like in the code below. -->
```{python}
#| label: fig-spatial-subset
#| fig-cap: Spatial subsetting of points by intersection with polygon
#| fig-subcap:
#| - Original points (red)
#| - Spatial subset based on intersection (red), geometry used for subsetting (Canterbury) (grey)
#| layout-ncol: 2
# Original
base = nz.plot(color='white', edgecolor='lightgrey')
nz_height.plot(ax=base, color='None', edgecolor='red');
# Subset (intersects)
base = nz.plot(color='white', edgecolor='lightgrey')
canterbury.plot(ax=base, color='lightgrey', edgecolor='darkgrey')
canterbury_height.plot(ax=base, color='None', edgecolor='red');
```
Like in attribute subsetting (@sec-vector-attribute-subsetting), we are using a boolean series (`sel`), of the same length as the number of rows in the filtered table (`nz_height`), created based on a condition applied on itself.
The difference is that the condition is not a comparison of attribute values, but an evaluation of a spatial relation.
Namely, we evaluate whether each geometry of `nz_height` intersects with `canterbury` geometry, using the `.intersects` method.
Various topological relations can be used for spatial subsetting which determine the type of spatial relationship that features in the target object must have with the subsetting object to be selected.
These include touches, crosses, or within, as we will see shortly in @sec-topological-relations.
Intersects (`.intersects`), which we used in the last example, is the the most commonly used method.
This is a 'catch all' topological relation, that will return features in the target that touch, cross or are within the source 'subsetting' object.
As an example of another method, we can use `.disjoint` to obtain all points that *do not* intersect with Canterbury.
```{python}
sel = nz_height.disjoint(canterbury.geometry.iloc[0])
canterbury_height2 = nz_height[sel]
```
The results are shown in @fig-spatial-subset-disjoint, which compares the original `nz_height` layer (left) with the subset `canterbury_height2` (right).
```{python}
#| label: fig-spatial-subset-disjoint
#| fig-cap: Spatial subsetting of points disjoint from a polygon
#| fig-subcap:
#| - Original points (red)
#| - Spatial subset based on disjoint (red), geometry used for subsetting (Canterbury) (grey)
#| layout-ncol: 2
# Original
base = nz.plot(color='white', edgecolor='lightgrey')
nz_height.plot(ax=base, color='None', edgecolor='red');
# Subset (disjoint)
base = nz.plot(color='white', edgecolor='lightgrey')
canterbury.plot(ax=base, color='lightgrey', edgecolor='darkgrey')
canterbury_height2.plot(ax=base, color='None', edgecolor='red');
```
In case we need to subset according to several geometries at once, e.g., find out which points intersect with both Canterbury and Southland, we can dissolve the filtering subset, using `.unary_union`, before applying the `.intersects` (or any other) operator.
For example, here is how we can subset the `nz_height` points which intersect with Canterbury or Southland.
(Note that we are also using the `.isin` method, as demonstrated in the end of @sec-vector-attribute-subsetting.)
```{python}
canterbury_southland = nz[nz['Name'].isin(['Canterbury', 'Southland'])]
sel = nz_height.intersects(canterbury_southland.unary_union)
canterbury_southland_height = nz_height[sel]
canterbury_southland_height
```
@fig-spatial-subset2 shows the results of the spatial subsetting of `nz_height` points by intersection with Canterbury and Southland.
```{python}
#| label: fig-spatial-subset2
#| fig-cap: Spatial subsetting of points by intersection with more that one polygon
#| fig-subcap:
#| - Original points (red)
#| - Spatial subset based on intersection (red), geometry used for subsetting (Canterbury and Southland) (grey)
#| layout-ncol: 2
# Original
base = nz.plot(color='white', edgecolor='lightgrey')
nz_height.plot(ax=base, color='None', edgecolor='red');
# Subset by intersection with two polygons
base = nz.plot(color='white', edgecolor='lightgrey')
canterbury_southland.plot(ax=base, color='lightgrey', edgecolor='darkgrey')
canterbury_southland_height.plot(ax=base, color='None', edgecolor='red');
```
The next section further explores different types of spatial relation, also known as binary predicates (of which `.intersects` and `.disjoint` are two examples), that can be used to identify whether or not two features are spatially related.
### Topological relations {#sec-topological-relations}
Topological relations describe the spatial relationships between objects.
"Binary topological relationships", to give them their full name, are logical statements (in that the answer can only be `True` or `False`) about the spatial relationships between two objects defined by ordered sets of points (typically forming points, lines and polygons) in two or more dimensions [@egenhofer_mathematical_1990].
That may sound rather abstract and, indeed, the definition and classification of topological relations is based on mathematical foundations first published in book form in 1966 [@spanier_algebraic_1995], with the field of algebraic topology continuing into the 21st century [@dieck_algebraic_2008].
Despite their mathematical origins, topological relations can be understood intuitively with reference to visualizations of commonly used functions that test for common types of spatial relationships.
@fig-spatial-relations shows a variety of geometry pairs and their associated relations.
The third and fourth pairs in @fig-spatial-relations (from left to right and then down) demonstrate that, for some relations, order is important: while the relations equals, intersects, crosses, touches and overlaps are symmetrical, meaning that if `x.relation(y)` is true, `y.relation(x)` will also be true, relations in which the order of the geometries are important such as contains and within are not.
::: callout-note
Notice that each geometry pair has a ["DE-9IM"](https://en.wikipedia.org/wiki/DE-9IM) string such as `FF2F11212`.
DE-9IM strings describe the dimensionality (0=points, 1=lines, 2=polygons) of the pairwise intersections of the interior, boundary, and exterior, of two geometries (i.e., nine values of 0/1/2 encoded into a string).
This is an advanced topic beyond the scope of this book, which can be useful to understand the difference between relation types, or define custom types of relations.
See the [DE-9IM strings](https://r.geocompx.org/spatial-operations#de-9im-strings) section in Geocomputation with R [@lovelace_geocomputation_2019].
Also note that the **shapely** package contains the `.relate` and `.relate_pattern` [methods](https://shapely.readthedocs.io/en/stable/manual.html#de-9im-relationships), for derive and test for DE-9IM patterns, respectively.
:::
![Topological relations between vector geometries, inspired by Figures 1 and 2 in Egenhofer and Herring (1990). The relations for which the `x.relation(y)` is true are printed for each geometry pair, with `x` represented in pink and `y` represented in blue. The nature of the spatial relationship for each pair is described by the Dimensionally Extended 9-Intersection Model string.](images/relations-1.png){#fig-spatial-relations}
In **shapely**, methods testing for different types of topological relations are known as ["relationships"](https://shapely.readthedocs.io/en/stable/manual.html#relationships).
**geopandas** provides their wrappers (with the same method name) which can be applied on multiple geometries at once (such as `.intersects` and `.disjoint` applied on all points in `nz_height`, see @sec-spatial-subsetting-vector).
To see how topological relations work in practice, let's create a simple reproducible example, building on the relations illustrated in @fig-spatial-relations and consolidating knowledge of how vector geometries are represented from a previous chapter (@sec-geometry-columns and @sec-geometries).
```{python}
points = gpd.GeoSeries([
shapely.Point(0.2,0.1),
shapely.Point(0.7,0.2),
shapely.Point(0.4,0.8)
])
line = gpd.GeoSeries([
shapely.LineString([(0.4,0.2), (1,0.5)])
])
poly = gpd.GeoSeries([
shapely.Polygon([(0,0), (0,1), (1,1), (1,0.5), (0,0)])
])
```
The sample dataset which we created is composed of three is `GeoSeries`: named `points`, `line`, and `poly`, which are visualized in @fig-spatial-relations-geoms.
The last expression is a `for` loop used to add text labels (`1`, `2`, and `3`) to identify the points; we are going to explain the concepts of text annotations with **geopandas** `.plot` in @sec-plot-static-labels.
```{python}
#| label: fig-spatial-relations-geoms
#| fig-cap: Points, line and polygon objects arranged to illustrate topological relations
base = poly.plot(color='lightgrey', edgecolor='red')
line.plot(ax=base, color='black', linewidth=7)
points.plot(ax=base, color='none', edgecolor='black')
for i in enumerate(points):
base.annotate(
i[0], xy=(i[1].x, i[1].y),
xytext=(3, 3), textcoords='offset points', weight='bold'
)
```
A simple query is: which of the points in `points` intersect in some way with polygon `poly`?
The question can be answered by visual inspection (points 1 and 3 are touching and are within the polygon, respectively).
Alternatively, we can get the solution with the `.intersects` method, which reports whether or not each geometry in a `GeoSeries` (`points`) intersects with a single `shapely` geometry (`poly.iloc[0]`).
```{python}
points.intersects(poly.iloc[0])
```
The result shown above is a boolean `Series`.
Its contents should match our intuition: positive (`True`) results are returned for the first and third point, and a negative result (`False`) for the second.
Each value in this `Series` represents a feature in the first input (`points`).
All earlier examples in this chapter demonstrate the "many-to-one" mode of `.intersects` and analogous methods, where the relation is evaluated between each of several geometries in a `GeoSeries`/`GeoDataFrame`, and an individual `shapely` geometry.
A second mode of those methods (not demonstrated here) is when both inputs are `GeoSeries`/`GeoDataFrame` objects.
In such case, a "pairwise" evaluation takes place between geometries aligned by index (`align=True`, the default) or by position (`align=False`).
For example, the expression `nz.intersects(nz)` returns a `Series` of 16 `True` values, indicating (unsurprisingly) that each geometry in `nz` intersects with itself.
A third mode is when we are interested in a "many-to-many" evaluation, i.e., obtaining a matrix of all pairwise combinations of geometries from two `GeoSeries` objects.
At the time of writing, there is no built-in method to do this in **geopandas**.
However, the [`.apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method can be used to repeat a "many-to-one" evaluation over all geometries in the second layer, resulting in a matrix of *pairwise* results.
We will create another `GeoSeries` with two polygons, named `poly2`, to demonstrate this.
```{python}
poly2 = gpd.GeoSeries([
shapely.Polygon([(0,0), (0,1), (1,1), (1,0.5), (0,0)]),
shapely.Polygon([(0,0), (1,0.5), (1,0), (0,0)])
])
```
Our two input objects, `points` and `poly2`, are illustrated in @fig-spatial-relations-geoms2.
```{python}
#| label: fig-spatial-relations-geoms2
#| fig-cap: Inputs for demonstrating the evaluation of all pairwise intersection relations between three points (`points`) and two polygons (`poly2`)
base = poly2.plot(color='lightgrey', edgecolor='red')
points.plot(ax=base, color='none', edgecolor='black')
for i in enumerate(points):
base.annotate(
i[0], xy=(i[1].x, i[1].y),
xytext=(3, 3), textcoords='offset points', weight='bold'
)
```
Now we can use to get the intersection relations matrix.
The result is a `DataFrame`, where each row represents a `points` geometry and each column represents a `poly` geometry.
We can see that the first point intersects with both polygons, while the second and third points intersect with one of the polygons each.
```{python}
points.apply(lambda x: poly2.intersects(x))
```
::: callout-note
The [`.apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method (package **pandas**) is used to apply a function along one of the axes of a `DataFrame` (or `GeoDataFrame`).
That is, we can apply a function on all rows (`axis=1`) or all columns (`axis=0`, the default).
When the function being applied returns a single value, the output of `.apply` is a `Series` (e.g., `.apply(len)` returns the lengths of all columns, because `len` returns a single value).
When the function returns a `Series`, then `.apply` returns a `DataFrame` (such as in the above example.)
:::
::: callout-note
Since the above result, like any pairwise matrix, (1) is composed of values of the same type, and (2) has no contrasting role for rows and columns, is may be more convenient to use a plain **numpy** array to work with it.
In such case, we can use the [`.to_numpy`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html) method to go from `DataFrame` to `ndarray`.
```{python}
points.apply(lambda x: poly2.intersects(x)).to_numpy()
```
:::
The `.intersects` method returns `True` even in cases where the features just touch: intersects is a 'catch-all' topological operation which identifies many types of spatial relation, as illustrated in @fig-spatial-relations.
More restrictive questions include which points lie within the polygon, and which features are on or contain a shared boundary with it?
The first question can be answered with `.within`, and the second with `.touches`.
```{python}
points.within(poly.iloc[0])
```
```{python}
points.touches(poly.iloc[0])
```
Note that although the first point touches the boundary polygon, it is not within it; the third point is within the polygon but does not touch any part of its border.
The opposite of `.intersects` is `.disjoint`, which returns only objects that do not spatially relate in any way to the selecting object.
```{python}
points.disjoint(poly.iloc[0])
```
Another useful type of relation is "within distance", where we detect features that intersect with the target buffered by particular distance.
Buffer distance determines how close target objects need to be before they are selected.
This can be done by literally buffering (@sec-geometries) the target geometry, and evaluating intersection (`.intersects`).
Another way is to calculate the distances using the `.distance` method, and then evaluate whether they are within a threshold distance.
```{python}
points.distance(poly.iloc[0]) < 0.2
```
Note that although the second point is more than `0.2` units of distance from the nearest vertex of `poly`, it is still selected when the distance is set to `0.2`.
This is because distance is measured to the nearest edge, in this case the part of the the polygon that lies directly above point 2 in Figure @fig-spatial-relations.
We can verify the actual distance between the second point and the polygon is `0.13`, as follows.
```{python}
points.iloc[1].distance(poly.iloc[0])
```
This is also a good opportunity to repeat that all distance-related calculations in **gopandas** (and **shapely**) assume planar geometry, and only take into account the coordinate values. It is up to the user to make sure that all input layers are in the same projected CRS, so that this type of calculations make sense (see @sec-geometry-operations-on-projected-and-unprojected-data and @sec-when-to-reproject).
### Spatial joining {#sec-spatial-joining}
Joining two non-spatial datasets uses a shared 'key' variable, as described in @sec-vector-attribute-joining.
Spatial data joining applies the same concept, but instead relies on spatial relations, described in the previous section.
As with attribute data, joining adds new columns to the target object (the argument `x` in joining functions), from a source object (`y`).
The following example illustrates the process: imagine you have ten points randomly distributed across the Earth's surface and you ask, for the points that are on land, which countries are they in?
Implementing this idea in a reproducible example will build your geographic data handling skills and show how spatial joins work.
The starting point is to create points that are randomly scattered over the planar surface that represents Earth's geographic coordinates, in decimal degrees (@fig-spatial-join (a)).
```{python}
np.random.seed(11) ## set seed for reproducibility
bb = world.total_bounds ## the world's bounds
x = np.random.uniform(low=bb[0], high=bb[2], size=10)
y = np.random.uniform(low=bb[1], high=bb[3], size=10)
random_points = gpd.points_from_xy(x, y, crs=4326)
random_points = gpd.GeoDataFrame({'geometry': random_points})
random_points
```
The scenario illustrated in @fig-spatial-join shows that the `random_points` object (top left) lacks attribute data, while the world (top right) has attributes, including country names shown for a sample of countries in the legend.
Before creating the joined dataset, we use spatial subsetting to create `world_random`, which contains only countries that contain random points, to verify the number of country names returned in the joined dataset should be four (see the top right panel of @fig-spatial-join (b)).
```{python}
world_random = world[world.intersects(random_points.unary_union)]
world_random
```
Spatial joins are implemented with `x.sjoin(y)`, as illustrated in the code chunk below.
The output is the `random_joined` object which is illustrated in @fig-spatial-join (c).
```{python}
random_joined = random_points.sjoin(world, how='left')
random_joined
```
@fig-spatial-join shows the input points and countries, the illustration of intersecting countries, and the join result.
```{python}
#| label: fig-spatial-join
#| fig-cap: Illustration of a spatial join
#| fig-subcap:
#| - A new attribute variable is added to random points,
#| - from source world object,
#| - resulting in points associated with country names
#| layout-ncol: 2
# Random points
base = world.plot(color='white', edgecolor='lightgrey')
random_points.plot(ax=base, color='None', edgecolor='red');
# World countries intersecting with the points
base = world.plot(color='white', edgecolor='lightgrey')
world_random.plot(ax=base, column='name_long');
# Points with joined country names
base = world.plot(color='white', edgecolor='lightgrey')
random_joined.geometry.plot(ax=base, color='grey')
random_joined.plot(ax=base, column='name_long', legend=True);
```
### Non-overlapping joins
Sometimes two geographic datasets do not touch but still have a strong geographic relationship.
The datasets `cycle_hire` and `cycle_hire_osm` provide a good example.
Plotting them reeveals that they are often closely related but they do not seem to touch, as shown in @fig-cycle-hire.
```{python}
#| label: fig-cycle-hire
#| fig-cap: The spatial distribution of cycle hire points in London based on official data (blue) and OpenStreetMap data (red).
base = cycle_hire.plot(edgecolor='blue', color='none')
cycle_hire_osm.plot(ax=base, edgecolor='red', color='none');
```
We can check if any of the points are the same by creating a pairwise boolean matrix of `.intersects` relations, then evaluating whether any of the values in it is `True`.
Note that the `.to_numpy` method is applied to go from a `DataFrame` to an `ndarray`, for which `.any` gives a global rather than a row-wise summary.
This is what we want in this case, because we are interested in whether any of the points intersect, not whether any of the points in each row intersect.
```{python}
m = cycle_hire.geometry.apply(
lambda x: cycle_hire_osm.geometry.intersects(x)
)
m.to_numpy().any()
```
Imagine that we need to join the capacity variable in `cycle_hire_osm` (`'capacity'`) onto the official 'target' data contained in `cycle_hire`, which looks as follows.
```{python}
cycle_hire
```
This is when a non-overlapping join is needed.
Spatial join (`gpd.sjoin`) along with buffered geometries (see @sec-buffers) can be used to do that, as demonstrated below using a threshold distance of 20 $m$.
Note that we transform the data to a projected CRS (`27700`) to use real buffer distances, in meters (see @sec-geometry-operations-on-projected-and-unprojected-data).
```{python}
crs = 27700
cycle_hire_buffers = cycle_hire.copy().to_crs(crs)
cycle_hire_buffers.geometry = cycle_hire_buffers.buffer(20)
cycle_hire_buffers = gpd.sjoin(cycle_hire_buffers, cycle_hire_osm.to_crs(crs))
cycle_hire_buffers
```
Note that the number of rows in the joined result is greater than the target.
This is because some cycle hire stations in `cycle_hire_buffers` have multiple matches in `cycle_hire_osm`.
To aggregate the values for the overlapping points and return the mean, we can use the aggregation methods shown in @sec-vector-attribute-aggregation, resulting in an object with the same number of rows as the target.
We also go back from buffers to points using `.centroid` method.
```{python}
cycle_hire_buffers = cycle_hire_buffers[['id', 'capacity', 'geometry']] \
.dissolve(by='id', aggfunc='mean') \
.reset_index()
cycle_hire_buffers.geometry = cycle_hire_buffers.centroid
cycle_hire_buffers
```
The capacity of nearby stations can be verified by comparing a plot of the capacity of the source `cycle_hire_osm` data, with the join results in the new object `cycle_hire_buffers` (@fig-cycle-hire-z).
```{python}
#| label: fig-cycle-hire-z
#| fig-cap: Non-overlapping join
#| fig-subcap:
#| - Input (`cycle_hire_osm`)
#| - Join result (`cycle_hire_buffers`)
#| layout-ncol: 2
# Input
fig, ax = plt.subplots(1, 1, figsize=(6, 3))
cycle_hire_osm.plot(column='capacity', legend=True, ax=ax);
# Join result
fig, ax = plt.subplots(1, 1, figsize=(6, 3))
cycle_hire_buffers.plot(column='capacity', legend=True, ax=ax);
```
### Spatial aggregation {#sec-vector-spatial-aggregation}
As with attribute data aggregation, spatial data aggregation condenses data: aggregated outputs have fewer rows than non-aggregated inputs.
Statistical aggregating functions, such as mean, average, or sum, summarize multiple values of a variable, and return a single value per grouping variable.
@sec-vector-attribute-aggregation demonstrated how the `.groupby` method, combined with summary functions such as `.sum`, condense data based on attribute variables.
This section shows how grouping by spatial objects can be achieved using spatial joins combined with non-spatial aggregation.
Returning to the example of New Zealand, imagine you want to find out the average height of `nz_height` points in each region.
It is the geometry of the source (`nz`) that defines how values in the target object (`nz_height`) are grouped.
This can be done in three steps:
1. Figuring out which `nz` region each `nz_height` point falls in---using `gpd.sjoin`
2. Summarizing the average elevation per region---using `.groupby` and `.mean`
3. Joining the result back to `nz`---using `pd.merge`
First, we 'attach' the region classification of each point, using spatial join (@sec-spatial-joining).
Note that we are using the minimal set of columns required: the geometries (for the spatial join to work), the point elevation (to later calculate an average), and the region name (to use as key when joining the results back to `nz`).
The result tells us which `nz` region each elevation point falls in.
```{python}
nz_height2 = gpd.sjoin(
nz_height[['elevation', 'geometry']],
nz[['Name', 'geometry']],
how='left'
)
nz_height2
```
Second, we calculate the average elevation, using ordinary (non-spatial) aggregation (@sec-vector-attribute-aggregation).
This result tells us the average elevation of all `nz_height` points located within each `nz` region.
```{python}
nz_height2 = nz_height2.groupby('Name')[['elevation']].mean()
nz_height2
```
The third and final step is joining the averages back to the `nz` layer.
```{python}
nz2 = pd.merge(nz[['Name', 'geometry']], nz_height2, on='Name', how='left')
nz2
```
We now have create the `nz_height4` layer, which gives the average `nz_height` elevation value per polygon.
The result is shown in @fig-nz-avg-nz-height.
Note that the `missing_kwds` part determines the style of geometries where the symbology attribute (`elevation`) is missing, because there were no `nz_height` points overlapping with them.
The default is to omit them, which is usually not what we want, but with `{'color':'none','edgecolor':'black'}`, those polygons are shown with black outline and no fill.
```{python}
#| label: fig-nz-avg-nz-height
#| fig-cap: Average height of the top 101 high points across the regions of New Zealand
nz2.plot(
column='elevation',
legend=True,
cmap='Blues', edgecolor='black',
missing_kwds={'color': 'none', 'edgecolor': 'black'}
);
```
### Joining incongruent layers {#sec-joining-incongruent-layers}
Spatial congruence is an important concept related to spatial aggregation.
An aggregating object (which we will refer to as `y`) is congruent with the target object (`x`) if the two objects have shared borders.
Often this is the case for administrative boundary data, whereby larger units---such as Middle Layer Super Output Areas (MSOAs) in the UK, or districts in many other European countries---are composed of many smaller units.
Incongruent aggregating objects, by contrast, do not share common borders with the target [@qiu_development_2012].
This is problematic for spatial aggregation (and other spatial operations) illustrated in @fig-nz-and-grid: aggregating the centroid of each sub-zone will not return accurate results.
Areal interpolation overcomes this issue by transferring values from one set of areal units to another, using a range of algorithms including simple area weighted approaches and more sophisticated approaches such as 'pycnophylactic' methods [@tobler_smooth_1979].
To demonstrate joining incongruent layers, we will create a "synthetic" layer comprising a [regular grid](https://gis.stackexchange.com/questions/322589/rasterizing-polygon-grid-in-python-geopandas-rasterio) of rectangles of size $100\times100$ $km$, covering the extent of the `nz` layer.
This recipe can be used to create a regular grid covering any given layer (other than `nz`), at the specified resolution (`res`).
Most of the functions have been explained in previous chapters; we leave it as an exerise for the reader to explore how the code works.
```{python}
# Settings: grid extent, resolution, and CRS
bounds = nz.total_bounds
crs = nz.crs
res = 100000
# Calculating grid dimensions
xmin, ymin, xmax, ymax = bounds
cols = list(range(int(np.floor(xmin)), int(np.ceil(xmax+res)), res))
rows = list(range(int(np.floor(ymin)), int(np.ceil(ymax+res)), res))
rows.reverse()
# For each cell, create 'shapely' polygon (rectangle)
polygons = []
for x in cols:
for y in rows:
polygons.append(
shapely.Polygon([(x,y), (x+res, y), (x+res, y-res), (x, y-res)])
)
# To 'GeoDataFrame'
grid = gpd.GeoDataFrame({'geometry': polygons}, crs=crs)
# Remove rows/columns beyond the extent
sel = grid.intersects(shapely.box(*bounds))
grid = grid[sel]
# Add consecultive IDs
grid['id'] = grid.index
grid
```
@fig-nz-and-grid shows the newly created `grid` layer, along with the `nz` layer.
```{python}
#| label: fig-nz-and-grid
#| fig-cap: The `nz` layer, with population size in each region, overlaid with a regular `grid` of rectangles
base = grid.plot(color='none', edgecolor='grey')
nz.plot(
ax=base,
column='Population',
edgecolor='black',
legend=True,
cmap='viridis_r'
);
```
Our goal, now, is to "transfer" the `'Population'` attribute (@fig-nz-and-grid) to the rectangular grid polygons, which is an example of a join between incongruent layers.
To do that, we basically need to calculate--for each `grid` cell---the weighted sum of the population in `nz` polygons coinciding with that cell.
The weights in the weighted sum calculation are the ratios between the area of the coinciding "part" out of the entire `nz` polygon.
That is, we (inevitably) assume that the population in each `nz` polygon is equally distributed across space, therefore a partial `nz` polygon contains the respective partial population size.
We start by calculating the entire area of each `nz` polygon, as follows, using the `.area` method (@sec-area-length).
```{python}
nz['area'] = nz.area
nz
```
Next, we use the [`.overlay`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.overlay.html) method to calculate the pairwise intersections between `nz` and `grid`.
As a result, we now have a layer where each `nz` polygon is "split" according to the `grid` polygons, hereby named `nz_grid`.
```{python}
nz_grid = nz.overlay(grid)
nz_grid = nz_grid[['id', 'area', 'Population', 'geometry']]
nz_grid
```
@fig-nz-and-grid-overlay illustrates the effect of `.overlay`:
```{python}
#| label: fig-nz-and-grid-overlay
#| fig-cap: The pairwise intersections of `nz` and `grid`, calculated with `.overlay`
nz_grid.plot(color='none', edgecolor='black');
```
We also need to calculate the areas of the intersections, here into a new attribute `'area_sub'`.
If an `nz` polygon was completely within a single `grid` polygon, then `area_sub` is going to be equal to `area`; otherwise, it is going to be smaller.
```{python}
nz_grid['area_sub'] = nz_grid.area
nz_grid
```
The resulting layer `nz_grid`, which the `area_sub` attribute, is shown in @fig-nz-and-grid2.
```{python}
#| label: fig-nz-and-grid2
#| fig-cap: The areas of pairwise intersections in the `nz_grid` layer
base = grid.plot(color='none', edgecolor='grey')
nz_grid.plot(ax=base, column='area_sub', edgecolor='black',
legend=True, cmap='viridis_r');
```
Note that each of the "intersections" still holds the `Population` attribute of its "origin" feature of `nz`, i.e., each portion of the `nz` area is associated with the original complete population count for that area.
The real population size of each `nz_grid` feature, however, is smaller, or equal, depending on the geographic area proportion that it occupies out of the original `nz` feature.
To make the "correction", we first calculate the ratio (`area_prop`) and then multiply it by the population.
The new (lowercase) attribute `population` now has the correct estimate of population sizes in `nz_grid`:
```{python}
nz_grid['area_prop'] = nz_grid['area_sub'] / nz_grid['area']
nz_grid['population'] = nz_grid['Population'] * nz_grid['area_prop']
nz_grid
```
What is left to be done is to sum (see @sec-vector-attribute-aggregation) the population in all parts forming the same grid cell and join (see @sec-vector-attribute-joining) them back to the `grid` layer.
Note that many of the grid cells have "No Data" for population, because they have no intersection with `nz` at all (@fig-nz-and-grid).
```{python}
nz_grid = nz_grid.groupby('id')['population'].sum().reset_index()
grid = pd.merge(grid, nz_grid[['id', 'population']], on='id', how='left')
grid
```
@fig-nz-and-grid3 shows the final result `grid` with the incongruently-joined `population` attribute from `nz`.
```{python}
#| label: fig-nz-and-grid3
#| fig-cap: 'The `nz` layer and a regular grid of rectangles: final result'
base = grid.plot(column='population', edgecolor='black',
legend=True, cmap='viridis_r');
nz.plot(ax=base, color='none', edgecolor='grey', legend=True);
```
We can demonstrate that, expectedly, the summed population in `nz` and `grid` is identical, even though the geometry is different (since we created `grid` to completely cover `nz`), by comparing the `.sum` of the `population` attribute in both layers.
```{python}
nz['Population'].sum()
```
```{python}
grid['population'].sum()
```
The procedure in this section is known as an area-weighted interpolation of a spatially *extensive* (e.g., population) variable.
In extensive interpolation, we assume that the variable of interest represents counts (such as, here, inhabitants) uniformly distributed across space.
In such case, each "part" of a given polygon captures the respective proportion of counts (such as, half of a region with $N$ inhabitants contains $N/2$ ihnabitants).
Accordingly, summing the parts gives the total count of the total area.
An area-weighted interpolation of a spatially *intensive* variable (e.g., population density) is almost identical, except that we would have to calculate the weighted `.mean` rather than `.sum`, to preserve the average rather than the sum.
In intensive interpolation, we assume that the variable of interest represents counts per unit area, i.e., density.
Since density is (assumed to be) uniform, any "part" of a given polygon has exactly the same density as that of the whole polygon.
Density values are therefore computed as weighted averages, rather than sums, of the parts.
Also see the "Area-weighted interpolation" [section](https://r-spatial.org/book/05-Attributes.html#sec-area-weighted) in @pebesma_spatial_2023.
### Distance relations {#sec-distance-relations}
While topological relations are binary---a feature either intersects with another or does not---distance relations are continuous.
The distance between two objects is calculated with the [`.distance`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.distance.html) method.
The method is applied on a `GeoSeries` (or a `GeoDataFrame`), with the argument being an individual `shapely` geometry.
The result is a `Series` of pairwise distances.
::: callout-note
**geopandas** uses similar syntax and mode of operation for many of its methods and functions, including:
* Numeric calculations, such as `.distance` (this section), returning numeric values
* Topological evaluations methods, such as `.intersects` or `.disjoint` (@sec-topological-relations), returning boolean values
* Geometry generating-methods, such as `.intersection` (@sec-clipping), returning geometries
In all cases, the input is a `GeoSeries` and (or a `GeoDataFrame`) and a `shapely` geometry, and the output is a `Series` or `GeoSeries` of results, contrasting each geometry from the `GeoSeries` with the `shapely` geometry.
The examples in this book demonstrate this, so called "many-to-one", mode of the functions.
All of the above-mentioned methods also have a pairwise mode, perhaps less useful and not used in the book, where we evaluate relations between pairs of geometries in two `GeoSeries`, aligned either by index or by position.
:::
To illustrate the `.distance` method, let's take the three highest point in New Zealand with `.sort_values` and `.iloc`.
```{python}
nz_highest = nz_height \
.sort_values(by='elevation', ascending=False) \
.iloc[:3, :]
nz_highest
```
Additionally, we need the geographic centroid of the Canterbury region (`canterbury`, created in @sec-spatial-subsetting-vector).
```{python}
canterbury_centroid = canterbury.centroid.iloc[0]
```
Now we are able to apply `.distance` to calculate the distances from each of the three elevation points to the centroid of the Canterbury region.
```{python}
nz_highest.distance(canterbury_centroid)
```
To obtain a distance matrix, i.e., a pairwise set of distances between all combinations of features in objects `x` and `y`, we need to use the method (analogous to the way we created the `.intersects` boolean matrix in @sec-topological-relations).
To illustrate this, let's now take two regions in `nz`, Otago and Canterbury, represented by the object `co`.
```{python}
sel = nz['Name'].str.contains('Canter|Otag')
co = nz[sel]
co
```
The distance matrix (technically speaking, a `DataFrame`) `d` between each of the first three elevation points, and the two regions, is then obtained as follows.
In plain language, we take the geometry from each each row in `nz_height.iloc[:3,:]`, and apply the `.distance` method on `co` with its rows as the argument.
```{python}
d = nz_height.iloc[:3, :].apply(lambda x: co.distance(x.geometry), axis=1)
d
```
Note that the distance between the second and third features in `nz_height` and the second feature in `co` is zero.
This demonstrates the fact that distances between points and polygons refer to the distance to any part of the polygon: the second and third points in `nz_height` are in Otago, which can be verified by plotting them (two almost completly overlappling points in @fig-nz-height-and-otago).
```{python}
#| label: fig-nz-height-and-otago
#| fig-cap: The first three `nz_height` points, and the Otago and Canterbury regions from `nz`
base = co.plot(color='lightgrey', edgecolor='black')
nz_height.iloc[:3, :].plot(ax=base, color='none', edgecolor='black');
```
## Spatial operations on raster data {#sec-spatial-ras}
This section builds on @sec-manipulating-raster-objects, which highlights various basic methods for manipulating raster datasets, to demonstrate more advanced and explicitly spatial raster operations, and uses the `elev.tif` and `grain.tif` rasters manually created in @sec-raster-from-scratch.
### Spatial subsetting {#sec-spatial-subsetting-raster}
The previous chapter (and especially @sec-manipulating-raster-objects) demonstrated how to retrieve values associated with specific row and column combinations from a raster.
Raster values can also be extracted by location (coordinates) and other spatial objects.
To use coordinates for subsetting, we can use the [`.sample`](https://rasterio.readthedocs.io/en/stable/api/rasterio.io.html#rasterio.io.DatasetReader.sample) method of a `rasterio` file connection object, combined with a list of coordinate tuples.
The methods is demonstrated below to find the value of the cell that covers a point located at coordinates of `(0.1,0.1)` in `elev`.
The returned object is a *generator*.
The rationale for returning a generator, rather than a `list`, is memory efficiency.
The number of sampled points may be huge, in which case we would want to "generate" the values one at a time rather than all at once.
```{python}
src_elev.sample([(0.1, 0.1)])
```
::: callout-note
The technical terms *iterable*, *iterator*, and *generator* in Python may be confusing, so here is a short summary, ordered from most general to most specific:
* An *iterable* is any object that we can iterate on, such as using a `for` loop. For example, a `list` is iterable.
* An *iterator* is an object that represents a stream of data, which we can go over, each time getting the next element using `next`. Iterators are also iterable, meaning that you can over them in a loop, but they are stateful (e.g., they remember which item was obtained using `next`), meaning that you can go over them just once.
* A *generator* is a function that returns an iterator. For example, the `.sample` method in the above example is a generator. The **rasterio** package makes use of generators in some of its functions, as we will see later on (@sec-raster-to-polygons).
:::
In case we nevertheless want all values at once, such as when the number of points is small, we can force the generatrion of all values from a generator at once, using `list`.
Since there was just one point, the result is one extracted value, in this case `16`.
```{python}
list(src_elev.sample([(0.1, 0.1)]))
```
We can use the same technique to extract the values of multiple points at once.
For example, here we extract the raster values at two points, `(0.1,0.1)` and `(1.1,1.1)`.
The resulting values are `16` and `6`.
```{python}
list(src_elev.sample([(0.1, 0.1), (1.1, 1.1)]))
```
The location of the two sample points on top of the `elev.tif` raster is illustrated in @fig-elev-sample-points.
```{python}
#| label: fig-elev-sample-points
#| fig-cap: The `elev.tif` raster, and two points where we extract its values
fig, ax = plt.subplots()
rasterio.plot.show(src_elev, ax=ax)
gpd.GeoSeries([shapely.Point(0.1, 0.1)]).plot(color='black', ax=ax)
gpd.GeoSeries([shapely.Point(1.1, 1.1)]).plot(color='black', ax=ax);
```
::: callout-note
We elaborate on the plotting technique used to display the points and the raster in @sec-plot-static-layers.
We will also introduce a more user-friendly and general method to extract raster values to points, using the **rasterstats** package, in @sec-extraction-to-points.
:::
Another common use case of spatial subsetting is using a boolean mask, based on another raster with the same extent and resolution, or the original one, as illustrated in @fig-raster-subset.
To do that, we "erase" the values in the array of one raster, according to another corresponding "mask" raster.
For example, let us read (@sec-using-rasterio) the `elev.tif` raster values into an array named `elev` (@fig-raster-subset (a)).
```{python}
elev = src_elev.read(1)
elev
```
and create a corresponding random boolean mask named `mask` (@fig-raster-subset (b)), of the same shape as `elev.tif` with values randomly assigned to `True` and `False`.
```{python}
np.random.seed(1)
mask = np.random.choice([True, False], src_elev.shape)
mask
```
Next, suppose that we want to keep only those values of `elev` which are `False` in `mask` (i.e., they are *not* masked).
In other words, we want to mask `elev` with `mask`.
The result will be stored in a copy named `masked_elev` (@fig-raster-subset (c)).
In the case of `elev.tif`, to be able to store `np.nan` in the array of values, we also need to convert it to `float` (see @sec-summarizing-raster-objects).
Afterwards, masking is a matter of assigning `np.nan` into a subset defined by the mask, using the ["boolean array indexing"](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing) syntax of **numpy**.
```{python}
masked_elev = elev.copy()
masked_elev = masked_elev.astype('float64')
masked_elev[mask] = np.nan
masked_elev
```
@fig-raster-subset shows the original `elev` raster, the `mask` raster, and the resulting `masked_elev` raster.
```{python}
#| label: fig-raster-subset
#| fig-cap: Subsetting raster values using a boolean mask
#| layout-ncol: 3
#| fig-subcap:
#| - Original raster
#| - Raster mask
#| - Output masked raster
rasterio.plot.show(elev);
rasterio.plot.show(mask);
rasterio.plot.show(masked_elev);
```
The "mask" can be create from the array itself, using condition(s).
That way, we can replace some values (e.g., values assumed to be wrong) with `np.nan`, such as in the following example.
```{python}
elev2 = elev.copy()
elev2 = elev2.astype('float64')
elev2[elev2 < 20] = np.nan
elev2
```
This technique is also used to reclassify raster values (see @sec-raster-local-operations).
### Map algebra {#sec-map-algebra}
The term 'map algebra' was coined in the late 1970s to describe a "set of conventions, capabilities, and techniques" for the analysis of geographic raster and (although less prominently) vector data [@tomlin_map_1994].
In this context, we define map algebra more narrowly, as operations that modify or summarize raster cell values, with reference to surrounding cells, zones, or statistical functions that apply to every cell.
Map algebra operations tend to be fast, because raster datasets only implicitly store coordinates, hence the old adage "raster is faster but vector is corrector".
The location of cells in raster datasets can be calculated by using its matrix position and the resolution and origin of the dataset (stored in the raster metadata, @sec-using-rasterio).
For the processing, however, the geographic position of a cell is barely relevant as long as we make sure that the cell position is still the same after the processing.
Additionally, if two or more raster datasets share the same extent, projection and resolution, one could treat them as matrices for the processing.
Map algebra (or cartographic modeling with raster data) divides raster operations into four subclasses [@tomlin_geographic_1990], with each working on one or several grids simultaneously:
- Local or per-cell operations (@sec-raster-local-operations)
- Focal or neighborhood operations. Most often the output cell value is the result of a $3 \times 3$ input cell block (@sec-focal-operations)
- Zonal operations are similar to focal operations, but the surrounding pixel grid on which new values are computed can have irregular sizes and shapes (@sec-zonal-operations)
- Global or per-raster operations; that means the output cell derives its value potentially from one or several entire rasters (@sec-global-operations-and-distances)
This typology classifies map algebra operations by the number of cells used for each pixel processing step and the type of the output.
For the sake of completeness, we should mention that raster operations can also be classified by discipline such as terrain, hydrological analysis, or image classification.
The following sections explain how each type of map algebra operations can be used, with reference to worked examples.
### Local operations {#sec-raster-local-operations}
Local operations comprise all cell-by-cell operations in one or several layers.
Raster algebra is a classical use case of local operations---this includes adding or subtracting values from a raster, squaring and multiplying rasters.
Raster algebra also allows logical operations such as finding all raster cells that are greater than a specific value (e.g., `5` in our example below).
Local operations are applied using the **numpy** array operations syntax, as demonstrated below.
First, let's take the array of `elev.tif` raster values, which we already read earlier (@sec-spatial-subsetting-raster).
```{python}
elev
```
Now, any element-wise array operation can be applied using **numpy** arithmetic or conditional operators and functions, comprising local raster operations in spatial analysis terminology.
For example `elev + elev` adds the values of `elev` to itself, resulting in a raster with double values.
```{python}
elev + elev
```
Note that some functions and operators automatically change the data type to accommodate the resulting values, while other operators do not, potentially resulting in overflow (i.e., incorrect values for results beyond the data type range, such as trying to accomodate values above `255` in an `int8` array).
For example, `elev**2` (`elev` squared) results in overflow.
Since the `**` operator does not automatically change the data type, leaving it as `int8`, the resulting array has incorrect values for `16**2`, `17**2`, etc., which are above `255` and therefore cannot be accomodated.
```{python}
elev**2
```
To avoid this situation, we can, for instance, transform `elev` to the standard `int64` data type, using `.astype` before applying the `**` operator.
That way all, results up to `36**2` (`1296`) can be easily accomodated, since the `int64` data type supports values up to `9223372036854775807` (@tbl-numpy-data-types).
```{python}
elev.astype(int)**2
```
Now we get correct results.
::: callout-note
**numpy** has the special data types `np.int_` and `np.float_`, which refer to "default" `int` and `float` data types. These are platform dependent, but typically resolve to `np.int64` and `np.float64`. Furthermore, the standard Python types `int` and `float` refer to those two **numpy** types, respectively. Therefore, for example, either of the three objects `np.int64`, `np.int_` and `int` can be passed to `.astype` in the above example, with identical result. Whereas we've used the shortest one, `int`.
:::
@fig-raster-local-operations demonstrates the result of the last two examples (`elev+elev` and `elev.astype(int)**2`), and two other ones (`np.log(elev)` and `elev>5`).
```{python}
#| label: fig-raster-local-operations
#| fig-cap: 'Examples of different local operations of the elev raster object: adding two rasters, squaring, applying logarithmic transformation, and performing a logical operation.'
#| layout-ncol: 4
#| fig-subcap:
#| - '`elev+elev`'
#| - '`elev.astype(int)**2`'
#| - '`np.log(elev)`'
#| - '`elev>5`'
rasterio.plot.show(elev + elev, cmap='Oranges');
rasterio.plot.show(elev.astype(int)**2, cmap='Oranges');
rasterio.plot.show(np.log(elev), cmap='Oranges');
rasterio.plot.show(elev > 5, cmap='Oranges');
```
Another good example of local operations is the classification of intervals of numeric values into groups such as grouping a digital elevation model into low (class `1`), middle (class `2`) and high elevations (class `3`).
Here, we assign the raster values in the ranges `0`--`12`, `12`--`24` and `24`--`36` are reclassified to take values `1`, `2` and `3`, respectively.
```{python}
recl = elev.copy()
recl[(elev > 0) & (elev <= 12)] = 1
recl[(elev > 12) & (elev <= 24)] = 2
recl[(elev > 24) & (elev <= 36)] = 3
```
@fig-raster-reclassify compares the original `elev` raster with the reclassified `recl` one.
```{python}
#| label: fig-raster-reclassify
#| fig-cap: Reclassifying a continuous raster into three categories.
#| layout-ncol: 2
#| fig-subcap:
#| - Original
#| - Reclassified
rasterio.plot.show(elev, cmap='Oranges');
rasterio.plot.show(recl, cmap='Oranges');
```
The calculation of the [Normalized Difference Vegetation Index (NDVI)](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index) is a well-known local (pixel-by-pixel) raster operation.
It returns a raster with values between `-1` and `1`; positive values indicate the presence of living plants (mostly \> `0.2`).
NDVI is calculated from red and near-infrared (NIR) bands of remotely sensed imagery, typically from satellite systems such as Landsat or Sentinel 2.
Vegetation absorbs light heavily in the visible light spectrum, and especially in the red channel, while reflecting NIR light, which is emulated in the NVDI formula (@eq-ndvi).
$$
NDVI=\frac{NIR-Red} {NIR+Red}
$$ {#eq-ndvi}