-
-
Notifications
You must be signed in to change notification settings - Fork 16
/
01-geodata.qmd
616 lines (459 loc) · 18.3 KB
/
01-geodata.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
# What is geospatial data?
```{julia}
#| echo: false
#| output: false
import Pkg
Pkg.activate(".")
```
Welcome to **Part I** of the book. Before we can start our journey in
geospatial data science, we need to introduce important concepts, which
will be the foundations of **Part II**, **Part III** and **Part IV**.
In this chapter, we define **geospatial data** and introduce a
**universal representation** for it which is ideal for geostatistical
analysis. Unlike other representations in the literature (e.g., "raster",
"vector" data), the proposed representation is suitable for encoding
geospatial data over 3D unstructured meshes, 2D images embedded in 3D space,
and other types of complex **geospatial domains**.
## Definition
::: {.callout-tip}
## Definition
**(Discrete) geospatial data** is the combination of a
**table** of attributes (or features) with a discretization of a
geospatial **domain**. Each row (or measurement) in the **table**
corresponds to an element (or geometry) in the discretization of
the geospatial **domain**.
:::
The definition depends on two other definitions that we clarify next.
### Table
In data science the most natural data structure for working with data is the
[table](https://en.wikipedia.org/wiki/Table_(database)).
Generally speaking, a table is any object that can be structured into
rows containing measurements and columns representing variables.
For example, @tbl-example has 5 measurements of 4 variables:
| NAME | AGE | HEIGHT | GENDER |
|:----:|:----:|:------:|:------:|
| John | 34 | 1.78m | male |
| Mary | 12 | 1.56m | female |
| Paul | 23 | 1.70m | male |
| Anne | 39 | 1.80m | female |
| Kate | 28 | 1.72m | female |
: Example of table {#tbl-example .striped .hover}
In Julia, the concept of table is formalized in
[Tables.jl](https://github.com/JuliaData/Tables.jl) by
@Quinn2023. The definition is independent of the machine
representation, and various representations can co-exist
in the language.
### Domain
The second definition that we need is that of a geospatial **domain**.
In geosciences, questions are often formulated within a physical
region of interest. This physical region can cover a small area
of the surface of the Earth, the entire Earth surface, or any
region of finite measure that can be discretized into smaller
geometries (a.k.a. elements):
::: {#fig-domains layout-ncol=2}
![Coast line in Islay Province, Peru. View on [Google Maps](https://www.google.com/maps/@-17.279111,-71.49577,2041m/data=!3m1!1e3?entry=ttu)](http://www.gstatic.com/prettyearth/assets/full/1096.jpg){#fig-islay-province}
![Synthetic carbonate reservoir model by @Correia2015. See [UNISIM-II](https://www.unisim.cepetro.unicamp.br/benchmarks/en/unisim-ii/overview) for more details](images/unisimII.png){#fig-unisimII}
Example of geospatial domains
:::
@fig-domains illustrates two very different examples of geospatial
domains. The domain in @fig-islay-province is widely studied
in GIS books. It is a 2D domain that contemplates a small area near
Islay Province, Peru. The domain in @fig-unisimII on the other hand
is **not** considered in traditional GIS literature. It is a 3D domain
that has been discretized into hexahedron geometries.
The concept of geospatial domain is formalized in
[Meshes.jl](https://github.com/JuliaGeometry/Meshes.jl) by
@Hoffimann2021.
### Remarks
- Images like the one depicted in @fig-islay-province are often
implemented in terms of the
[array](https://en.wikipedia.org/wiki/Array_(data_structure))
data structure. GIS books call it "raster data", but we will
avoid this term in our framework in order to obtain a more
general set of tools.
- According to our definition of geospatial data, "raster data"
is simply a table with colors as variables (e.g., RGB values)
combined with a grid of quadrangle geometries. We illustrate
this concept in @fig-raster by zooming in a satellite image
of the Lena delta:
::: {#fig-raster layout-ncol=2}
!["Raster data" of [Lena delta](https://en.wikipedia.org/wiki/Lena_Delta_Wildlife_Reserve)](images/lena.png){#fig-lena}
![Zoom reveals quadrangle geometries](images/lena-zoom.png){#fig-lena-zoom}
Quadrangle geometries in "raster data"
:::
- There are no constraints on the geometries used in the
discretization of the geospatial domain. In @fig-brazil,
Brazil is discretized into complex polygonal geometries
that represent country states:
![Brazil's states represented with complex polygonal geometries. View on [Google Maps](https://www.google.com/maps/place/Brasil/@-14.2009197,-62.5722989,4z).](images/brazil.png){#fig-brazil width=50%}
- GIS books call it "vector data" because the geometries are
stored as vectors of coordinates in memory. We will also avoid
this term in our framework given that it only highlights an
implementation detail.
Before we start discussing machine representation with actual Julia
code, let's make a final (pedantic) distinction between the words
*geospatial* and *spatial*. These words mean different things
in different communities:
- In geosciences, an object is *geospatial* if it lives in
*physical space*.
- In statistics, a model is *spatial* if it exploits the
vicinity of samples in the *sample space*.
Given that geospatial data science deals with both concepts,
we must use these words carefully.
::: {.callout-note}
In [Geostatistical Learning](https://www.frontiersin.org/articles/10.3389/fams.2021.689393/full),
models can exploit both spaces to improve prediction performance,
but that is out of the scope for this book.
:::
## Representation
Based on the definition of geospatial data given in the previous
section, we are now ready to proceed and discuss an efficient
machine representation for it with actual Julia code.
### Table
The Julia language comes with two built-in table representations:
1. Named tuple of vectors
2. Vector of named tuples
The first representation focuses on the columns of the table:
```{julia}
coltable = (
NAME=["John", "Mary", "Paul", "Anne", "Kate"],
AGE=[34, 12, 23, 39, 28],
HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72],
GENDER=["male", "female", "male", "female", "female"]
)
```
Given that data science is often performed with entire columns,
this column-major representation of a table is very convenient.
The second representation focuses on the rows of the table:
```{julia}
rowtable = [
(NAME="John", AGE=34, HEIGHT=1.78, GENDER="male"),
(NAME="Mary", AGE=12, HEIGHT=1.56, GENDER="female"),
(NAME="Paul", AGE=23, HEIGHT=1.70, GENDER="male"),
(NAME="Anne", AGE=39, HEIGHT=1.80, GENDER="female"),
(NAME="Kate", AGE=28, HEIGHT=1.72, GENDER="female")
]
```
The row-major representation can be useful to process data that
is potentially larger than the available computer memory, or
infinite streams of data.
Although these two representations come built-in with Julia, they
lack basic functionality for data science. The most widely used
table representation for data science in Julia is available in
[DataFrames.jl](https://github.com/JuliaData/DataFrames.jl)
by @Bogumil2023.
```{julia}
#| output: false
using DataFrames
df = DataFrame(
NAME=["John", "Mary", "Paul", "Anne", "Kate"],
AGE=[34, 12, 23, 39, 28],
HEIGHT=[1.78, 1.56, 1.70, 1.80, 1.72],
GENDER=["male", "female", "male", "female", "female"]
)
```
```{julia}
#| echo: false
show(stdout, MIME("text/html"), df; header_alignment = :s)
```
This representation provides additional syntax for accessing
rows and columns of the table:
```{julia}
#| output: false
df[1,:]
```
```{julia}
#| echo: false
show(stdout, MIME("text/html"), df[1,:]; header_alignment = :s)
```
```{julia}
df[:,"NAME"]
```
```{julia}
#| output: false
df[1:3,["NAME","AGE"]]
```
```{julia}
#| echo: false
show(stdout, MIME("text/html"), df[1:3,["NAME","AGE"]]; header_alignment = :s)
```
```{julia}
df.HEIGHT
```
```{julia}
df."HEIGHT"
```
::: {.callout-note}
Unlike other languages, Julia makes a distinction between
the the symbol `:HEIGHT` and the string `"HEIGHT"`. The
`DataFrame` representation supports both types for
column names, but that is not always the case with
other table representations.
:::
Other popular table representations in Julia are associated
with specific file formats:
- `CSV.File` from [CSV.jl](https://github.com/JuliaData/CSV.jl) [@Quinn2023]
- `XLSX.Worksheet` from [XLSX.jl](https://github.com/felipenoris/XLSX.jl)
- Databases from [JuliaDatabases](https://github.com/JuliaDatabases)
The choice of table representation is a function of the application.
### Domain
All available domain representations come from the
[Meshes.jl](https://github.com/JuliaGeometry/Meshes.jl)
module.
Let's start with a simple list of disconnected geometries:
```{julia}
using GeoStats
p = Point(1, 2)
s = Segment((0, 2), (1, 3))
t = Triangle((0, 0), (1, 0), (1, 1))
b = Ball((2, 2), 1)
geoms = [p, s, t, b]
```
Because these geometries are unaware of each other, we place
them into a `GeometrySet`, informally known in computational
geometry as the "soup of geometries" data structure:
```{julia}
gset = GeometrySet(geoms)
```
No advanced knowledge is required to start working
with these geometries. For example, we can compute the
`length` of the `Segment`, the `area` of the `Triangle`
and the `area` of the `Ball` with:
```{julia}
length(s), area(t), area(b)
```
More generally, we can compute the `measure` of the
geometries in the domain:
```{julia}
[measure(g) for g in gset]
```
In the example above, we iterated over the domain to apply the
function of interest, but we could have used Julia's dot syntax
for broadcasting the function over the geometries:
```{julia}
measure.(gset)
```
The list of supported geometries is very comprehensive. It
encompasses all geometries from the
[simple features](https://www.iso.org/standard/40114.html)
standard and more. We will see more examples in the following
chapters.
One of the main limitations of GIS software today is the
lack of explicit representation of [topology](https://en.wikipedia.org/wiki/Topology).
A `GeometrySet` does not provide efficient topological relations
[@Floriani2007], yet advanced geospatial data science requires
the definition of geospatial domains where geometries are aware
of their neighbors. Let's illustrate this concept with the
`CartesianGrid` domain:
```{julia}
grid = CartesianGrid(10, 10)
```
We can access the individual geometries of the domain as before:
```{julia}
grid[1]
```
And even though we can manipulate this domain as if it was a
"soup of geometries", the major advantage in this abstraction
is the underlying `topology`:
```{julia}
topo = topology(grid)
```
This data structure can be used by advanced users who wish
to design algorithms with neighborhood information. We will
cover this topic in a separate chapter. For now, keep
in mind that working with the entire domain as opposed to
with a vector or "soup of geometries" has major benefits.
::: {.callout-note}
The `CartesianGrid` domain is *lazy*, meaning it only stores the
start and end points of the grid together with the spacing
between the elements. Therefore, we can easily create large
3D grids of `Hexahedron` geometries without consuming all
available memory:
```{julia}
grid = CartesianGrid(10000, 10000, 10000)
```
```{julia}
grid[1]
```
:::
In computational geometry, a `CartesianGrid` is a specific
type of [mesh](https://en.wikipedia.org/wiki/Polygon_mesh).
It can only represent "flat" domains sampled regularly along
each dimension (e.g., images). To represent domains with
curvature such as terrain elevation models or complex 3D
domains like the one in @fig-unisimII, we can use the
`SimpleMesh` domain:
```{julia}
# global vector of 2D points
points = [(0.0, 0.0), (1.0, 0.0), (0.0, 1.0), (1.0, 1.0), (0.25, 0.5), (0.75, 0.5)]
# connect the points into N-gons
connec = connect.([(1, 2, 6, 5), (2, 4, 6), (4, 3, 5, 6), (3, 1, 5)])
# 2D mesh made of N-gon elements
mesh = SimpleMesh(points, connec)
```
The `connect` function takes a tuple of indices and a geometry type,
and produces a connectivity object. The geometry type can be omitted,
in which case it is assumed to be a `Ngon`, i.e., a polygon with `N`
sides:
```{julia}
c = connect((1, 2, 3))
```
This connectivity object can be materialized into an actual geometry
with a vector of points:
```{julia}
materialize(c, [Point(0, 0), Point(1, 0), Point(1, 1)])
```
The `SimpleMesh` uses the `materialize` function above to construct
geometries on the fly, similar to what we have seen with the `CartesianGrid`:
```{julia}
mesh[1]
```
Don't worry if you feel overwhelmed by these concepts. We are only sharing
them here to give you an idea of how complex 3D domains are represented in
the framework. You can do geospatial data science without ever having to
operate with these concepts explicitly.
Let's make a few important remarks:
- Flexibility comes with a price. To construct a `SimpleMesh` of connected
geometries we need to explicitly create a vector of vertices, and connect
these vertices into geometries using their indices in the vector.
- Geometries in a `SimpleMesh` can be of different type. In the example,
we have both `Triangle` and `Quadrangle` geometries in the domain. This is
similar to what we had with `GeometrySet`, but now the geometries are
connected.
- `SimpleMesh` are rarely constructed by hand. They are often the result of
a sophisticated geometric processing pipeline that is already stored in a
file on disk.
The last missing piece of the puzzle is the combination of tables with domains
into geospatial data, which we discuss next.
### Data
Wouldn't it be nice if we had a representation of geospatial data that behaved like
a table as discussed in the [Tables](#tables) section, but preserved
topological information as discussed in the [Domains](#domains) section?
In the [GeoStats.jl](https://github.com/JuliaEarth/GeoStats.jl) framework, this is
precisely what we get with the `georef` function:
```{julia}
using GeoStats
df = DataFrame(
NAME=["John", "Mary", "Paul", "Anne"],
AGE=[34.0, 12.0, 23.0, 39.0]u"yr",
HEIGHT=[1.78, 1.56, 1.70, 1.80]u"m",
GENDER=["male", "female", "male", "female"]
)
grid = CartesianGrid(2, 2)
geotable = georef(df, grid)
```
::: {.callout-note}
## Tip for all users
The framework is integrated with the [Unitful.jl](https://github.com/PainterQubits/Unitful.jl)
module. To add the unit "meter" to the numeric value `1.0`, we can write `1.0u"m"`.
Similarly, we can add units to vectors of values:
```{julia}
[1.0, 2.0, 3.0]u"m"
```
It is also possible to load units explicitly to avoid the "u" prefix:
```julia
using Unitful: m, ft
1.0m + 2.0ft
```
:::
The function combines any table with any domain into a geospatial data
representation that adheres to the Tables.jl interface. We call this
representation a `GeoTable` to distinguish it from a standard table.
Besides the original columns, the `GeoTable` has a special `geometry`
column with the underlying domain:
```{julia}
names(geotable)
```
Unlike a standard table, the `GeoTable` creates geometries on the fly
depending on the data access pattern. For example, we can request the
first measurement of the `GeoTable` and it will automatically construct
the corresponding `Quadrangle`:
```{julia}
geotable[1,:]
```
If we request a subset of measurements, the `GeoTable` will avoid
unnecessary creation of geometries, and will instead return a view
into the original data:
```{julia}
geotable[1:3,["NAME","AGE"]]
```
Finally, if we request the entire `geometry` column, we get back the original domain:
```{julia}
geotable[:,"geometry"]
```
Besides the data access patterns of the `DataFrame`, the `GeoTable` also provides
an advanced method for retrieving all rows that intersect with a given geometry:
```{julia}
geotable[Segment((0, 0), (2, 0)), :]
```
This method is very useful to narrow the region of interest and quickly discard
all measurements that are outside of it. For instance, it is common to discard
all "pixels" outside of a polygon before exporting the geotable to a file on disk.
Notice that the `GeoTable` representation is general enough to accommodate both
"raster data" and "vector data" in traditional GIS. We can create very large rasters
because the `CartesianGrid` is lazy:
```{julia}
georef(
(
R=rand(1000000),
G=rand(1000000),
B=rand(1000000)
),
CartesianGrid(1000, 1000)
)
```
And can load vector geometries from files that store simple features
using the [GeoIO.jl](https://github.com/JuliaEarth/GeoIO.jl) module:
```{julia}
using GeoIO
GeoIO.load("data/countries.geojson")
```
::: {.callout-note}
The "data" folder is stored on GitHub.
Check the [Preface](preface.qmd) for download instructions.
:::
We will see more examples of "vector data" in the chapter
[Interfacing with GIS](03-geoio.qmd), and will explain why
file formats like
[Shapefile.jl](https://github.com/JuliaGeo/Shapefile.jl) and
[GeoJSON.jl](https://github.com/JuliaGeo/GeoJSON.jl) are not enough
for advanced geospatial data science.
::: {.callout-note}
## Tip for all users
The GeoStats.jl module reexports the full stack of modules for
geospatial data science in Julia. There is no need to import
modules like Meshes.jl explicitly. You are all set if you
start your script with
```julia
using GeoStats
```
:::
::: {.callout-note}
## Tip for advanced users
In Julia, a function is type-stable if the return type is known at compile
time. Since the `GeoTable` has columns from the original table (e.g., numbers)
and an additional special `geometry` column, the access to the data with the
`DataFrame` syntax is **not** type stable. If you need to write type-stable
code, use the functions `values` and `domain` instead:
```{julia}
values(geotable)
```
```{julia}
domain(geotable)
```
:::
## Remarks
What did we learn in this chapter?
- Geospatial data can be efficiently represented as a `GeoTable`.
This representation is universal, meaning it combines the
"raster" and "vector" representations found in traditional GIS.
- A `GeoTable` provides all the data access patterns of a `DataFrame`.
In particular, it supports the syntax `geotable[rows,cols]` without
expensive copies of geometries.
- Working with geospatial domains as opposed to working with vectors
of disconnected geometries has major benefits. In particular, it
preserves topological information.
It is very convenient to manipulate a `GeoTable` as if it was a `DataFrame`.
Nevertheless, we will learn that advanced geospatial data science requires
higher-level constructs to preserve geospatial information. We will cover
these constructs in **Part II** and **Part III** of the book.