Skip to content

Commit

Permalink
added info about geoparquet
Browse files Browse the repository at this point in the history
  • Loading branch information
carmengg committed Dec 2, 2023
1 parent 51d9c40 commit 2b4137b
Showing 1 changed file with 139 additions and 0 deletions.
139 changes: 139 additions & 0 deletions lectures/lesson-21-contextily-parquet.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
---
jupyter: mpc-env-kernel
---
# Misc

In this lesson we will retrieve data from the [2020 Census from the Microsoft Planetary Computer's STAC catalog](https://planetarycomputer.microsoft.com/dataset/us-census) using **GeoParquet**.
We will also introduce the **contextily** library for adding basemaps.


## Parquet and GeoParquet

[**Apache Parquet**](https://parquet.apache.org) (or just parquet) is an open-source, column-oriented file format that makes it faster to retrieve data and uses less memory space to store tabular data. It is very popular for storing large amounts of data, instead of using, for example, CSV files.


<!-- https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705
https://www.upsolver.com/blog/apache-parquet-why-use
-->

The geospatial version of parquet for storing vector data is the [**GeoParquet**](https://geoparquet.org) data format.
This format comes from the necessity to have an efficient, standardized data format to store and query big geospatial data efficiently.
GeoParquet was first introduced in December 2022.
Similarly to STAC, this is a new and ongoing effort to create standards in the geospatial analysis community given the rapid increase in geospatial data available.

<!-- https://getindata.com/blog/introducing-geoparquet-data-format/
https://cholmes.medium.com/geoparquet-1-0-0-beta-1-released-6390ecb4c6d0
https://geoparquet.org
-->

For this lesson, the

## Accessing GeoParquet file

```{python}
import geopandas
import planetary_computer
import pystac_client
import matplotlib.pyplot as plt
import contextily as ctx
```

References:

Tile gallery:
https://xyzservices.readthedocs.io/en/stable/gallery.html

Intro to contextily
https://contextily.readthedocs.io/en/latest/intro_guide.html#

Geopandas:
https://geopandas.org/en/stable/gallery/plotting_basemap_background.html#add-background-tiles-to-plot

Troubleshooting:
https://github.com/geopandas/contextily/issues/118
https://github.com/geopandas/contextily/issues/78

```{python}
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace,
)
search = catalog.search(collections=["us-census"])
items = {item.id: item for item in search.items()}
list(items)
```

```{python}
item = items['2020-cb_2020_us_county_500k']
item
```

```{python}
asset = item.assets["data"]
asset
```

```{python}
df = geopandas.read_parquet(
asset.href,
storage_options=asset.extra_fields["table:storage_options"],
)
df.head()
```

```{python}
# Default: OpenStreetMap HOT style
ax = (
df[df.NAME == "Santa Barbara"]
.to_crs(epsg=3857)
.plot(figsize=(7, 7), alpha=0.5, edgecolor="k")
)
ax.set_title(
"Santa Barbara County",
fontdict={"fontsize": "20"}
)
ctx.add_basemap(ax)
ax.set_axis_off()
```

```{python}
#| tags: []
ax = (
df[df.NAME == "Santa Barbara"]
.to_crs(epsg=3857)
.plot(figsize=(7, 7), alpha=0.5, edgecolor="k")
)
ax.set_title(
"Santa Barbara County",
fontdict={"fontsize": "20"}
)
ctx.add_basemap(ax, source=ctx.providers.Esri.NatGeoWorldMap)
ax.set_axis_off()
```

```{python}
# changing basemaps
# https://contextily.readthedocs.io/en/latest/providers_deepdive.html
```

```{python}
ctx.providers
```

```{python}
# # there's no phoenix subdivision in 2020 census data
# cousub = items['2020-cb_2020_us_cousub_500k']
# cousub_df = geopandas.read_parquet(
# asset.href,
# storage_options=asset.extra_fields["table:storage_options"],
# )
# cousub_df[cousub_df['NAME']=='Phoenix']
```


0 comments on commit 2b4137b

Please sign in to comment.