This repo contains source code for the thesis Novel methods for developing large-scale, data-driven, biologically informed bioregionalisations
by Philip Dyer, published in 2024.
The source code requires some datasets to be available locally. Other datasets are downloaded on demand, and cached.
The output consists of R objects, stored in an R targets
cache, and plots, stored in an outputs
folder.
The branch f_varied_res
was used to generate results for the thesis.
Further development on the source code will take place at https://github.com/MathMarEcol/pdyer_aus_bio
This code is published as part of academic research, I do not intend on keeping the source “closed”. I will release appropriate licensing information after consulting with my institution.
Once the license is released, you should be able to modify the code to fit your environment and extend the research.
- Get access to a Slurm workload manager on a Linux system, or modify the code to use another scheduler. The code currently assumes you have access to a Slurm workload manager. Many HPCs use Slurm. You can set up Slurm on a local computer, but how to do that is beyond the scope of this document.
- Make sure the
nix
package manager is on the path for all compute workers. - Set up folders
- Set up the storage location for long term storage. This will be
ROOT_STORE_DIR
. Outputs, datasets, logs, and caches will be zipped and stored at this location. - Prepare the computing scratch location. Often this will be dynamically generated by the workload manager.
- Make sure
$TMPDIR
on the workers has a lot of space available.
- Set up the storage location for long term storage. This will be
- Access datasets and put them in the appropriate folder.
- All datasets are stored in subfolders of
$ROOT_STORE_DIR/data
- All datasets are stored in subfolders of
- Some modifications to the code will be needed
- create a new HName entry in
./shell/aus_bio_submit.sh
, follow the existing examples and make sure all env vars are defined, and ends with a call tosbatch aus_bio_batch.sh
- create a new Host entry in
./R/functions/configure_parallel.R
, follow the existing examples and make sure every worker type is defined - These are the places to add other workload managers as well
- create a new HName entry in
- run ~./shell/aus_bio_submit.sh f_varied_res slurmacctstring ROOT_STORE_DIR_subfoldername
The datasets will be pulled to the working directory, the analysis will be performed in the working directory, then logs, some datasets, plots, and the R targets cache will be packed up and copied back to ROOT_STORE_DIR/subfoldername
.
If the analysis does not complete, the partial results will be copied back. Subsequent runs will reuse the R targets cache to avoid re-running code that succesfully completed and has not changed.
The blog post https://zameermanji.com/blog/2023/3/26/using-nix-without-root/ provides info about setting up Nix even if you do not have administrator rights on the machine.
In summary:
curl -L https://hydra.nixos.org/job/nix/maintenance-2.20/buildStatic.x86_64-linux/latest/download-by-type/file/binary-dist > nix
- Put the downloaded
nix
binary on the path- Some HPC systems have a
bin
folder in each user’s home directory that can be used to add binaries to the path.
- Some HPC systems have a
- Add and edit ~~/.config/nix/nix.conf~. The important settings are:
store = ~/mynixroot
extra-experiment al-features = flakes nix-command
ssl-cert-file = /etc/pki/tls/cert.pem
where store
is the location of the nix store, where all software will go.
The store path can end up with 50GB or more easily, and uses a large number of inodes.
~nix store gc
will remove excess files.
*
Data downloaded from https://data.bioplatforms.com/bpa/otu on 2019-07-03.
License depends on sample project, but samples I looked at were CC-BY-4.0-AU.
Login is required.
Amplicon was set to XXXXXX_bacteria
and, under contexual filter, Environment was set to marine
.
Then download OTU and contextual data as CSV.
BioORACLE data are downloaded at runtime and cached. However, make sure that an empty folder is present at $ROOT_STORE_DIR/data/bioORACLE
.
The R package sdmpredictors
or biooracler
is used to load the dataset.
License is GPL (version not specified, see https://bio-oracle.org/downloads-to-email.php).
Data is available through IMOS and R package planktonr
.
Some data is fetched on demand from planktonr
, no further action is needed.
Other data has been preprocessed for this project, please clone https://github.com/MathMarEcol/aus_cpr_for_bioregions into ~~$ROOT_STORE_DIR/data/AusCPR/~
AODN prefers CC-BY-4.0
AusCPR is CC-BY-4.0
Sourced from https://marineregions.org/downloads.php.
License is CC-BY-NC-SA
Place extracted shapefiles into $ROOT_STORE_DIR/data/ShapeFiles/World_EEZ_v8/
Source code assumes shapefiles are named World_EEZ_v8_2014_HR
Sourced from the World Database of Protected Areas (WDPA https://www.protectedplanet.net/country/AUS).
Non-commercial use with attribution required.
Download the .SHP variant.
Note that WDPA splits the dataset up into three separate datasets. The source code assumes each dataset will be extracted and placed into:
$ROOT_STORE_DIR/data/mpa_poly_june_2023/aus_mpa_0
$ROOT_STORE_DIR/data/mpa_poly_june_2023/aus_mpa_1
$ROOT_STORE_DIR/data/mpa_poly_june_2023/aus_mpa_2
Either follow this convention or modify ./R/functions/get_mpa_polys.R
.
Published Watson and Tidd https://doi.org/10.25959/5c522cadbea37
CC-BY-4.0 for data
Version 4 is available publically. V5 is behind a login, and source code expects some preprocessing.
As I do not currenlty have permission to share V5 and the preprocessing scripts, functionality related to this dataset has been commented out.
The whole project is assumed to be inside the MathMarEcol QRIScloud collection Q1216/pdyer
.
The
The code/
folder contains the drake_plan.R and other scripts and code for the project.
The data are all stored in a different QRIScloud collection, Q1215
.
Different HPC systems have a different folder for the QRIScloud data, but Q1215 and Q1216 are always sibling folders, so relative paths will work, and will be more reliable than hard paths.
Given that HPC code should not be run over the network, I copy the relevant parts of Q1215
and Q1216
into 30days
or something similar on Awoonga, before running Rscript drake_plan.R
Crew provides a unified frontend for workers.
No longer need to differentiate between local and cluster execution, or call a different top-level function depending on whether future, clustermq or sequential execution are needed.
Always call tar_make()
and ensure the controllers
tar_option is set appropriately.
Each target has a distinct resource requirement.
Some are small and fast, some require lots of memory, some internally use paralellisation, and benefit from having lots of cores available.
Experience tells me that it is better to compute targets sequentially rather than in parallel if the total runtime is the same. Parallel computation should only be used if there are spare resources.
In practice, this means that branches that internally run in parallel should be given the whole node.
- Branch types
- single
- single cpu, can run in parallel with other branches
- GPU
- needs the GPU, or a whole node for BLAS/LAPACK
- BLAS may need the env var
XXX_NUM_THREADS
set, according to the number of CPUs
- BLAS may need the env var
- multicore
- the branch internally uses parallel, so can use a whole node
- Need to make sure future is configured
RAM requirements are set per job, 4GB is enough for many small jobs. Bigger jobs will need tuning according to the dataset, can use 100’s of GB.
One goal is to make the code run in different environments with minimal changes.
Crew helps, but different controllers are needed for different environments, eg. local vs slurm.
I may end up needing to use the configure_parallel function to just list controllers, and use some flag to choose between them.
Targets will use crew to assign branches to workers.
Some functions can run in parallel, but all use the future framework to decide if it is possible.
crew might be able to set up future plans for workers that expect multicore operations. It doesn’t seem to. Each target could set the plan just before calling the function. Given that the resoureces are specified in the same place, the relevant information would be kept together.
future.callr is probably the most flexible and reliable for running within a single node. future.mirai is under development, but locally it behaves largely like future.callr.
If you really don’t have access to slurm or a workload manager:
git clone -b f_varied_res --single-branch https://github.com/MathMarEcol/pdyer_aus_bio.git ./code
- Copy all datasets into subfolders of
./code/R/data
, see./shell/aus_bio_control.sh
for the appropriate folder names - From
./code/R
, callR --vanilla -e "targets::tar_make(reporter = 'verbose_positives')"
- To avoid issues with R package mismatches, put nix on your path and call
NIX_GL_PREFIX="nixglhost -- "; nix develop github:PhDyellow/nix_r_dev_shell/${R_SHELL_REV}#devShells."x86_64-linux".r-shell -c $NIX_GL_PREFIX R --vanilla -e "targets::tar_make(reporter = 'verbose_positives')"
- Leave out NIX_GL_PREFIX if you are not using a GPU or are on NixOS. If not using a GPU, make sure any calls to TENSOR_DEVICE are not set to
CUDA
in./code/R/functions/configure_parallel.R
- To avoid issues with R package mismatches, put nix on your path and call
This work © 2024 by Philip Dyer is licensed under CC BY 4.0