Workflows to run ADAPT on AWS Batch.
For more information on ADAPT and on how to run it, please see the ADAPT repository on GitHub.
-
Go to AWS CloudFormation, and click "Create Stack". If prompted, click "With new resources (standard)".
-
Choose "Template is ready" and "Upload a template file". Upload
/cromwell-setup/vpcstack.json
, then hit "Next". -
Name your stack (ex.
Cromwell-VPC
). -
Select regions for your availability zones. You must select between 2 and 4 regions. If you are unsure, select
us-east-1a
,us-east-1b
,us-east-1c
, andus-east-1d
. -
Select the number of availability zones that matches the number of regions you chose in Step 4.
-
Keep the defaults for the rest of the options on this page, and hit "Next".
-
Add any tags you would like, then hit "Next". Tags will be added to all AWS resources built by the stack and serve as additional metadata.
-
Click "Create Stack".
-
After the stack has finished running, click on the stack name, click on
Outputs
, and record each Private and Public Subnet ID (in the formsubnet-#################
) and the VPC ID (in the formvpc-#################
). You will need them to set up the Genomics Workflow Core and Cromwell Resources.
-
Open
Installing the Genomics Workflow Core and Cromwell.pdf
in/cromwell-setup/
and follow the instructions. Whenever it asks to use VPC subnets, use as many as you can from "Setting up a VPC". -
If there are issues with running the stacks, try replacing "latest" with "v3.0.2" in any S3 file paths.
-
If it still is not working, upload the contents of
/cromwell-setup/cromwell-setup.zip
to an S3 bucket, and run the stacks using paths to your personal S3 bucket. These have slight modifications to the templates inInstalling the Genomics Workflow Core and Cromwell.pdf
that allow AWS Batch to use the optimal instance size rather than selecting from a predefined list of instance types. -
After the stacks have finished running, click on the core stack name, click on
Outputs
, and record theDefaultJobQueueArn
, thePriorityJobQueueArn
, and theS3BucketName
. You will need these to set up your input files. Then, click on the resources stack name, click onOutputs
, and record theHostName
. TheHostName
will be how you connect to your Cromwell Server.
In order to do this, you will need the following values you recorded while building your server. If you didn't record them, they can be found by going to AWS Cloud Formation and following the instructions in Step 4 of "Setting up Genomics Workflow Core and Cromwell Resources".
-
DefaultJobQueueArn
orPriorityJobQueueArn
: the Batch queue to run your jobs on. TheDefaultJobQueueArn
uses Spot instances if capacity is available, then On Demand instances; thePriorityJobQueueArn
uses On Demand instances until a limit is reached , at which point it will use Spot instances. TheDefaultJobQueueArn
costs less, but thePriorityJobQueueArn
will work faster. If you do not have access to the Cloud Formation stack and need to find theDefaultJobQueueArn
orPriorityJobQueueArn
, go to the AWS Batch Management Console, click on "Job queues", and look for the queue with "Default" or "Priority" (and likely "Cromwell") in their names. Click on it, and record the ARN (Amazon Resource Name). -
S3BucketName
: the S3 Bucket where your Cromwell files are If you do not have access to the Cloud Formation stack and need to find theS3BucketName
, go to the AWS S3 Management Console and click through the buckets until you find one with a folder called_gwfcore
. Record this bucket's name. -
HostName
: the URL for your server. If you do not have access to the Cloud Formation stack and need to find theHostName
, go to the AWS EC2 Management Console, go to your list of instances, and find the one named "cromwell-server" (or something similar). The "Public IPv4 DNS" of this instance is yourHostName
.
You may either use our Docker images or create your own. If you would like to use our Docker images, use quay.io/broadinstitute/adaptcloud
to use cloud memoization features. Otherwise, or if you're unsure, use quay.io/broadinstitute/adapt
.
If you would like to build your own Docker images, do the following:
-
Clone the ADAPT repository to your computer using the following command:
$ git clone https://github.com/broadinstitute/adapt.git
- Go into the repository, and build the ADAPT docker image using the following commands:
$ cd adapt
$ docker build . -t adapt
- If you would like to use cloud memoization features, run the following command:
$ docker build . -t adaptcloud -f ./cloud.Dockerfile
If you are building your own Docker image, you will also need to publish it. You can do this either via DockerHub or via AWS itself. The following are instructions of how to publish your image using AWS.
-
Install the AWS Command Line Interface.
-
Click "Create Repository".
-
Name your repository, keep the other options at their defaults, and click "Create Repository".
-
Click on your repository's name, click "View push commands", and then follow the instructions listed there to push your Docker image to AWS.
-
Click back to the ECR home screen, and record the URI of your image.
To send the job to your Cromwell server, you will need two or three files locally:
-
a WDL workflow for ADAPT. To design for a single taxon, use
single_adapt.wdl
. To design for multiple taxa in parallel, useparallel_adapt.wdl
. -
a JSON file of inputs to your WDL To design for a single taxon, modify
single_adapt_input_template.json
. Details on each of the inputs are below:
- single_adapt.adapt.queueArn: Queue ARN (Amazon Resource Name) of the queue you want the jobs to run on. This should be either the
DefaultJobQueueArn
or thePriorityJobQueueArn
. - single_adapt.adapt.taxid: Taxonomic ID of the design to create.
- single_adapt.adapt.ref_accs: Accession number for sequences for references used by ADAPT for curation; separate multiple with commas.
- single_adapt.adapt.segment: Segment number of genome to design for; set to 'None' for unsegmented genomes.
- single_adapt.adapt.obj: Objective (either 'minimize-guides' or 'maximize-activity').
- single_adapt.adapt.specific: true to be specific against the taxa listed in specificity_taxa, false to not be specific.
- single_adapt.adapt.image: URI for Docker ADAPT Image to use
- single_adapt.adapt.specificity_taxa: Optional, only needed if specific is true. AWS S3 path to file that contains a list of taxa to be specific against. Should have no headings, but be a list of taxonomic IDs in the first column and segment numbers in the second column
- single_adapt.adapt.rand_sample: Optional, take a sample of RAND_SAMPLE sequences from the taxa to design for.
- single_adapt.adapt.rand_seed: Optional, set ADAPT's random seed to get consistent results across runs.
- single_adapt.adapt.bucket: Optional, S3 bucket for cloud memoization. May include path to put memo in a subfolder; do not include '' at the end.
- single_adapt.adapt.memory: Optional, sets the memory each job uses. Defaults to 2GB. If jobs fail unexpectedly, increase this.
To design for multiple taxa in parallel, modify
parallel_adapt_input_template.json
. Details on each of the inputs are below: - parallel_adapt.queueArn: Queue ARN (Amazon Resource Name) of the queue you want the jobs to run on. This should be either the
DefaultJobQueueArn
or thePriorityJobQueueArn
. - parallel_adapt.objs: Array of objective functions to design for; can include any of {"maximize-activity", "minimize-guides"}.
- parallel_adapt.sps: Array; include "true" in the array to have designs made specific against any other order in the same family that is listed in ALL_TAXA_FILE; include "false" to design nonspecifically.
- parallel_adapt.taxa_file: AWS S3 path to a TSV file that contains a list of taxa to design for. Headings should be 'family', 'genus', 'species', 'taxid', 'segment', 'refseqs', 'neighbor-count'.
- parallel_adapt.format_taxa.all_taxa_file: AWS S3 path to a TSV file that contains a list of all taxa to be specific against (note: will only check for specificity within a family). Can be the same file as TAXA_FILE. Headings should be 'family', 'genus', 'species', 'taxid', 'segment', 'refseqs', 'neighbor-count'.
- parallel_adapt.adapt.image: URI for Docker ADAPT Image to use
- parallel_adapt.adapt.bucket: Optional, S3 bucket for cloud memoization. May include path to put memo in a subfolder; do not include '/' at the end.
- parallel_adapt.adapt.memory: Optional, sets the memory each job uses. Defaults to 2GB. If jobs fail unexpectedly, increase this.
- a configuration file for AWS (optional, only necessary for running workflows through a Cromwell call)
Modify anything that says
REGION
,S3BUCKET
, orQUEUEARN
inaws-template.conf
.REGION
should be the region in which your S3 bucket is stored and your job queues are. You should see something likeus-east-1
in theDefaultJobQueueArn
/PriorityJobQueueArn
; this is the region it is in.S3BUCKET
should be theS3BucketName
.QUEUEARN
should be either theDefaultJobQueueArn
or thePriorityJobQueueArn
.
There are three methods to run a workflow on your Cromwell Server-either through the Swagger UI, through an HTTP POST command, or through a Cromwell call.
To access the Swagger UI, go to your HostName
URL in a web browser. Note, it does not work in Chrome; use Firefox, Safari, Edge, or Internet Explorer instead. You may need to ignore a security warning about a self-signed certificate to access the page; to do so, click "Advanced" or "More Information" and then continue to the webpage.
To run your workflow, click POST /api/workflows/{version}
, click "Try it Out", set version
to "v1", upload your WDL workflow to workflowSource
, upload your JSON input file to workflowInputs
, set workflowType
to "WDL", set workflowTypeVersion
to "1.0", and click "Execute". Record the workflow ID outputted.
To check the status of your workflow, click GET /api/workflows/{version}/{id}/status
, click "Try it Out", set version
to "v1", set id
to the workflow ID previously outputted, and click "Execute".
To get the outputs of your workflow once it has finished running, click GET /api/workflows/{version}/{id}/outputs
, click "Try it Out", set version
to "v1", set id
to the workflow ID previously outputted, and click "Execute". You will get S3 paths to the files containing your outputs, which you can access via the S3 dashboard.
You may keep track of the status of each job produced by the workflow by referring to the AWS Batch Dashboard.
To run your workflow, open a terminal, and run the following command:
$ curl -k -X POST "https://{HostName}/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@{WDL Workflow}" \
-F "workflowInputs=@{JSON Inputs}"
To check the status of your workflow, run the following command:
$ curl -k -X GET "https://{HostName}/api/workflows/v1/{id}/status
To get the outputs of your workflow once it has finished running, run the following command:
$ curl -k -X GET "https://{HostName}/api/workflows/v1/{id}/outputs
You will get S3 paths to the files containing your outputs, which you can access via the S3 dashboard.
You may keep track of the status of each job produced by the workflow by referring to the AWS Batch Dashboard.
First, you will need to download Cromwell. You will only need to download cromwell-54.jar
. You will also need to install the AWS Command Line Interface and add the AWSBatchFullAccess
permissions policy to your account via IAM (click on "Users", your account name, "Add permissions", "Attach existing policies directly", "AWSBatchFullAccess", "Next: Review", and finally "Add permissions").
To run your workflow, open a terminal, and run the following command:
$ java -Dconfig.file={AWS Configuration file} -jar {path to Cromwell jar file} run {WDL Workflow} -i {JSON Inputs}
You will get updates on the status of your workflow in the terminal, as well as the S3 paths to the files of your outputs. You can access these via the S3 dashboard.