This README covers the code in this repository.
The WRF-Hydro/NWM native outputs are generally by output time with no spatial chunking. This output format is difficult for any timeseries analysis with the model outputs, particularly when the domains size is large (e.g. the NWM CONUS domain).
Aggregating the model output into better, more usable and accessible datasets has been a long-standing challenge. Particularly with large datasets, scaling the computations and memory to reprocess the data in a reasonable amount of time was the crux of the challenge. After multiple attemps using xarray, we discovered we were not alone in this problem. This pangeo discourse thread gave birth to the rechunker python package which provides a general solution to this problem and uses dask and disk to manage computation and memory. Currently the only output option for the rechunker packages is to write Zarr stores. However, there is nothing much wrong with Zarr other than it dosent have interfaces to several popular languages besides python.
Rich Signell (USGS) first applied rechunker to the NWM CHRTOUT (streamflow model) output in as discussed here whic points to this rechunking notebook and this analysis notebook.
Following on this success, both NOAA's Office of Water Prediction and NCAR were interested in doing something similar for the next version of the NWM retrospective run. For NWM version 2.1, the task of rechunking the full retrospective on NCAR computing resources (where the model runs were performed) is expressed in this repository. This included 4 other output file types, including gridded land surface model data and terrain/subsurface routing data. Also rechunked were the input/forcing precipitation data (just one variable from the full forcing data).
This was a relatively quick project (about a month). While we made a first pass at streamlining and unifying the main code, we maintained separate codes for each of the 6 file types. This was because each had some level of idiosyncrasy that would have required unallotted time to accommodate. This unification is left to another iteration.
Overall the code is fairly robust to failure. The main issue is if the job dies while in the append step to the full Zarr store. (This happens sometimes
when the requested dask workers either take time to come available or are supplied at a significantly lower number than the request. Of course this
can also happen in other kinds of failures, like power outages.) We wrote lock files to indicate if/when the append to the full Zarr store was in progress.
To recover from these failures, we used xarray's capability ro write regions to a Zarr file (e.g. Dataset.to\_zarr(file, region={"time": region\_update})
.
This highligted that appending to the Zarr store might not be the preferred strategy and that perhaps all the data could be written with the region approach,
after initalizing the zarr store (e.g. Dataset.to\zarr(file, compute=False)
). Following this approach might also provide the advantages that, for each file type,
individual variables could then be processed in parallel (using separate jobs writing to different variables in the same store) which could further speed up
the time to rechunking.
Finally, we maintained the storage integer types and scale_factor
and add_offset
information that originated in the model in the Zarr stores. We noted that several of the integers could likely be shortened from int32 to int16 without loss of information while providing improved compression (smaller Zarr stores).
There is a directory for each file type
lakeout/
gwout/
chrtout/
precip/
ldasout/
rtout/
in this directory will be the following for each type
:
type_report.log
type_verification.log
type_script_pbs.sh
type_submit_pbs.py
type_to_zarr.py
These files are explained below in detail. The log files contain all (most?) of the relevant information for understanding the data from both zarr and xarray perspectives.
This file is report of metadata and other attributes (file sizes, chunk shapes and sizes, storage
types, memory types compression ratios, etc). Reports are generated by the verify_output.py
script using a command like:
(379zr) jamesmcc@crhtc55[1029]:~/WRF_Hydro/rechunk_retro_nwm_v21> ./verify_output.py /glade/scratch/jamesmcc/retro_collect/gwout/gwout.zarr | tee gwout/gwout_verification.log
The report starts with the full path to the file being reported.
Next is the total size of the Zarr store (from du -shc
).
Then the zarr infor is printed for the full store. Next the xarray infor for the store is printed. These provide very different information but only a high-level overview of the data set.
Next information is printed for every variable in the store, again both from xarray and zarr perspectives. This is important because of two reasons:
- The data are stored as integers and converted to other types by xarray using scale_factor and add_offset metadata attributes
- There is compression being applied by Zarr.
The individual variable information shows how xarray is representing the data in memory (e.g. float64), how zarr is storing the data on disk (e.g. int32), the chunk size (e.g. time, feature_id) the total number of bytes (without compression) and the number of bytest actually stored (the compression ratio being the computed from these last two numbers of bytes). Then the full chunk size is reported (without compression) in MB. These numbers are helpful in iterating to obtain the desired chunk sizes.
This file is the log of a verification procedure run which compares the output data against the input data. There is a hardwired number of datetimes (currently 30) for which the data (full space) are verified against the input files. That number will include the first and last times being checked first. The remainder of the times will be drawn at random from the index of times in the zarr output file (without replacement). The static variables in the file are checked once (at the first time) and the time varying variables are checked at every time. Values are checked to have less absolute differece than 1e-8 and if nans are present it is required that the diffs have the same number of nans as both of the original and zarr datasets. The verification is logged in a similar manner to the report, as shown above.
This simply a static scipt designed to submit the type_to_zarr.py
to the PBS scheduler on the
casper system at NCAR. It takes no arguments.
This python script determines the number of times the type_script_pbs.sh
needs to be executed and
submits this many dependent jobs to the PBS queue. The dependence does not require success, so the
jobs will continue in the event one of them fails.
This is the python script which converts a given type to a single zarr file. It checks if the output file already exists, and if it does it then knows to start from the end of the existing output to continue the rechunking.
This file is not present in every type. This is the script that fixes a failed write on the current/latest chunk while it was being appended. It's
largely redundant with type_to_zarr.py
and the codes have not been consolidated.