Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using git subtree
.
Some tools may only live here temporarily before finding a permanent home in
augur curate
or Nextstrain CLI. Others may happily live out their days here.
Nextstrain maintained pathogen repos will use git subrepo
to vendor ingest scripts.
(See discussion on this decision in nextstrain/ingest#3)
If you don't already have git subrepo
installed, follow the git subrepo installation instructions.
Then add the latest ingest scripts to the pathogen repo by running:
git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
Any future updates of ingest scripts can be pulled in with:
git subrepo pull ingest/vendored
If you run into merge conflicts and would like to pull in a fresh copy of the
latest ingest scripts, pull with the --force
flag:
git subrepo pull ingest/vendored --force
Warning Beware of rebasing/dropping the parent commit of a
git subrepo
update
git subrepo
relies on metadata in the ingest/vendored/.gitrepo
file,
which includes the hash for the parent commit in the pathogen repos.
If this hash no longer exists in the commit history, there will be errors when
running future git subrepo pull
commands.
If you run into an error similar to the following:
$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''
Check the parent commit hash in the ingest/vendored/.gitrepo
file and make
sure the commit exists in the commit history. Update to the appropriate parent
commit hash if needed.
Much of this tooling originated in ncov-ingest and was passaged thru monkeypox's ingest/. It subsequently proliferated from monkeypox to other pathogen repos (rsv, zika, dengue, hepatitisB, forecasts-ncov) primarily thru copying. To counter that proliferation, this repo was made.
The creation of this repo, in both the abstract and concrete, and the general approach to "ingest" has been discussed in various internal places, including:
- https://github.com/nextstrain/private/issues/59
- @joverlee521's workflows document
- 5 July 2023 Slack thread
- 6 July 2023 team meeting
- …many others
Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.
- notify-on-diff - Send Slack message with diff of a local file and an S3 object
- notify-on-job-fail - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
- notify-on-job-start - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
- notify-on-record-change - Send Slack message with details about line count changes for a file compared to an S3 object's metadata
recordcount
. If the S3 object's metadata does not haverecordcount
, then will attempt to download S3 object to count lines locally, which only supportsxz
compressed S3 objects. - notify-slack - Send message or file to Slack
- s3-object-exists - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
- trigger - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
- trigger-on-new-data - Triggers downstream GitHub Actions if the provided
upload-to-s3
outputs do not contain theidentical_file_message
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
NCBI interaction scripts that are useful for fetching public metadata and sequences.
- fetch-from-ncbi-entrez - Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file. Useful for pathogens with metadata and annotations in custom fields that are not part of the standard NCBI Datasets outputs.
Historically, some pathogen repos used the undocumented NCBI Virus API through fetch-from-ncbi-virus to fetch data. However we've opted to drop the NCBI Virus scripts due to nextstrain/ingest#18.
Potential Nextstrain CLI scripts
- sha256sum - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
- cloudfront-invalidate - CloudFront invalidation is already supported in the nextstrain remote command for S3 files. This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
- upload-to-s3 - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
Skips upload if the local file's hash is identical to the S3 object's metadata
sha256sum
. Adds the following user defined metadata to uploaded S3 object:sha256sum
- hash of the file generated by sha256sumrecordcount
- the line count of the file
- download-from-s3 - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
Skips download if the local file already exists and has a hash identical to the S3 object's metadata
sha256sum
.
Potential augur curate scripts
- apply-geolocation-rules - Applies user curated geolocation rules to NDJSON records
- merge-user-metadata - Merges user annotations with NDJSON records
- transform-authors - Abbreviates full author lists to ' et al.'
- transform-field-names - Rename fields of NDJSON records
- transform-genbank-location - Parses
location
field with the expected pattern"<country_value>[:<region>][, <locality>]"
based on GenBank's country field - transform-strain-names - Ordered search for strain names across several fields.
Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (/bin/bash
) does not meet this requirement. You can install Homebrew's Bash which is more up to date.
Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.
For more locally testable scripts, Cram-style functional tests live in tests
and are run as part of CI. To run these locally,
- Download Cram:
pip install cram
- Run the tests:
cram tests/