Skip to content

MathMarEcol/pdyer_aus_bio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project for finding a Bioregionalisation around Australia

Overview

This repo contains source code for the thesis Novel methods for developing large-scale, data-driven, biologically informed bioregionalisations by Philip Dyer, published in 2024.

The source code requires some datasets to be available locally. Other datasets are downloaded on demand, and cached.

The output consists of R objects, stored in an R targets cache, and plots, stored in an outputs folder.

The branch f_varied_res was used to generate results for the thesis.

Further development on the source code will take place at https://github.com/MathMarEcol/pdyer_aus_bio

Modifying the Code

This code is published as part of academic research, I do not intend on keeping the source “closed”. I will release appropriate licensing information after consulting with my institution.

Once the license is released, you should be able to modify the code to fit your environment and extend the research.

Running the Code

  1. Get access to a Slurm workload manager on a Linux system, or modify the code to use another scheduler. The code currently assumes you have access to a Slurm workload manager. Many HPCs use Slurm. You can set up Slurm on a local computer, but how to do that is beyond the scope of this document.
  2. Make sure the nix package manager is on the path for all compute workers.
  3. Set up folders
    1. Set up the storage location for long term storage. This will be ROOT_STORE_DIR. Outputs, datasets, logs, and caches will be zipped and stored at this location.
    2. Prepare the computing scratch location. Often this will be dynamically generated by the workload manager.
    3. Make sure $TMPDIR on the workers has a lot of space available.
  4. Access datasets and put them in the appropriate folder.
    1. All datasets are stored in subfolders of $ROOT_STORE_DIR/data
  5. Some modifications to the code will be needed
    1. create a new HName entry in ./shell/aus_bio_submit.sh, follow the existing examples and make sure all env vars are defined, and ends with a call to sbatch aus_bio_batch.sh
    2. create a new Host entry in ./R/functions/configure_parallel.R, follow the existing examples and make sure every worker type is defined
    3. These are the places to add other workload managers as well
  6. run ~./shell/aus_bio_submit.sh f_varied_res slurmacctstring ROOT_STORE_DIR_subfoldername

The datasets will be pulled to the working directory, the analysis will be performed in the working directory, then logs, some datasets, plots, and the R targets cache will be packed up and copied back to ROOT_STORE_DIR/subfoldername.

If the analysis does not complete, the partial results will be copied back. Subsequent runs will reuse the R targets cache to avoid re-running code that succesfully completed and has not changed.

Extra info

Getting nix

The blog post https://zameermanji.com/blog/2023/3/26/using-nix-without-root/ provides info about setting up Nix even if you do not have administrator rights on the machine.

In summary:

  1. curl -L https://hydra.nixos.org/job/nix/maintenance-2.20/buildStatic.x86_64-linux/latest/download-by-type/file/binary-dist > nix
  2. Put the downloaded nix binary on the path
    1. Some HPC systems have a bin folder in each user’s home directory that can be used to add binaries to the path.
  3. Add and edit ~~/.config/nix/nix.conf~. The important settings are:
	store = ~/mynixroot
	extra-experiment	al-features = flakes nix-command
	ssl-cert-file = /etc/pki/tls/cert.pem

where store is the location of the nix store, where all software will go.

Notes on Nix

The store path can end up with 50GB or more easily, and uses a large number of inodes. ~nix store gc will remove excess files.

*

Datasets and license notes

Australian Microbiome Initiative

Data downloaded from https://data.bioplatforms.com/bpa/otu on 2019-07-03.

License depends on sample project, but samples I looked at were CC-BY-4.0-AU.

Login is required.

Amplicon was set to XXXXXX_bacteria and, under contexual filter, Environment was set to marine.

Then download OTU and contextual data as CSV.

BioORACLE

BioORACLE data are downloaded at runtime and cached. However, make sure that an empty folder is present at $ROOT_STORE_DIR/data/bioORACLE.

The R package sdmpredictors or biooracler is used to load the dataset.

License is GPL (version not specified, see https://bio-oracle.org/downloads-to-email.php).

AusCPR

Data is available through IMOS and R package planktonr.

Some data is fetched on demand from planktonr, no further action is needed.

Other data has been preprocessed for this project, please clone https://github.com/MathMarEcol/aus_cpr_for_bioregions into ~~$ROOT_STORE_DIR/data/AusCPR/~

AODN prefers CC-BY-4.0

AusCPR is CC-BY-4.0

World EEZ v8

Sourced from https://marineregions.org/downloads.php.

License is CC-BY-NC-SA

Place extracted shapefiles into $ROOT_STORE_DIR/data/ShapeFiles/World_EEZ_v8/

Source code assumes shapefiles are named World_EEZ_v8_2014_HR

MPA polygons

Sourced from the World Database of Protected Areas (WDPA https://www.protectedplanet.net/country/AUS).

Non-commercial use with attribution required.

Download the .SHP variant.

Note that WDPA splits the dataset up into three separate datasets. The source code assumes each dataset will be extracted and placed into:

  • $ROOT_STORE_DIR/data/mpa_poly_june_2023/aus_mpa_0
  • $ROOT_STORE_DIR/data/mpa_poly_june_2023/aus_mpa_1
  • $ROOT_STORE_DIR/data/mpa_poly_june_2023/aus_mpa_2

Either follow this convention or modify ./R/functions/get_mpa_polys.R.

Watson Fisheries Data

Published Watson and Tidd https://doi.org/10.25959/5c522cadbea37

CC-BY-4.0 for data

Version 4 is available publically. V5 is behind a login, and source code expects some preprocessing.

As I do not currenlty have permission to share V5 and the preprocessing scripts, functionality related to this dataset has been commented out.

Directory structure

The whole project is assumed to be inside the MathMarEcol QRIScloud collection Q1216/pdyer. The

The code/ folder contains the drake_plan.R and other scripts and code for the project.

The data are all stored in a different QRIScloud collection, Q1215. Different HPC systems have a different folder for the QRIScloud data, but Q1215 and Q1216 are always sibling folders, so relative paths will work, and will be more reliable than hard paths.

Given that HPC code should not be run over the network, I copy the relevant parts of Q1215 and Q1216 into 30days or something similar on Awoonga, before running Rscript drake_plan.R

Update for targets and crew

Crew provides a unified frontend for workers.

No longer need to differentiate between local and cluster execution, or call a different top-level function depending on whether future, clustermq or sequential execution are needed. Always call tar_make() and ensure the controllers tar_option is set appropriately.

Balancing workloads

Each target has a distinct resource requirement.

Some are small and fast, some require lots of memory, some internally use paralellisation, and benefit from having lots of cores available.

Experience tells me that it is better to compute targets sequentially rather than in parallel if the total runtime is the same. Parallel computation should only be used if there are spare resources.

In practice, this means that branches that internally run in parallel should be given the whole node.

  • Branch types
    single
    single cpu, can run in parallel with other branches
    GPU
    needs the GPU, or a whole node for BLAS/LAPACK
    • BLAS may need the env var XXX_NUM_THREADS set, according to the number of CPUs
    multicore
    the branch internally uses parallel, so can use a whole node
    • Need to make sure future is configured

RAM requirements are set per job, 4GB is enough for many small jobs. Bigger jobs will need tuning according to the dataset, can use 100’s of GB.

Making sure the right controllers are used

One goal is to make the code run in different environments with minimal changes.

Crew helps, but different controllers are needed for different environments, eg. local vs slurm.

I may end up needing to use the configure_parallel function to just list controllers, and use some flag to choose between them.

Future framework

Targets will use crew to assign branches to workers.

Some functions can run in parallel, but all use the future framework to decide if it is possible.

crew might be able to set up future plans for workers that expect multicore operations. It doesn’t seem to. Each target could set the plan just before calling the function. Given that the resoureces are specified in the same place, the relevant information would be kept together.

future.callr is probably the most flexible and reliable for running within a single node. future.mirai is under development, but locally it behaves largely like future.callr.

Run Locally

If you really don’t have access to slurm or a workload manager:

  1. git clone -b f_varied_res --single-branch https://github.com/MathMarEcol/pdyer_aus_bio.git ./code
  2. Copy all datasets into subfolders of ./code/R/data, see ./shell/aus_bio_control.sh for the appropriate folder names
  3. From ./code/R, call R --vanilla -e "targets::tar_make(reporter = 'verbose_positives')"
    1. To avoid issues with R package mismatches, put nix on your path and call NIX_GL_PREFIX="nixglhost -- "; nix develop github:PhDyellow/nix_r_dev_shell/${R_SHELL_REV}#devShells."x86_64-linux".r-shell -c $NIX_GL_PREFIX R --vanilla -e "targets::tar_make(reporter = 'verbose_positives')"
    2. Leave out NIX_GL_PREFIX if you are not using a GPU or are on NixOS. If not using a GPU, make sure any calls to TENSOR_DEVICE are not set to CUDA in ./code/R/functions/configure_parallel.R

Licence

This work © 2024 by Philip Dyer is licensed under CC BY 4.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published