Skip to content

Commit

Permalink
Merge branch 'main' into gtr_substitution
Browse files Browse the repository at this point in the history
  • Loading branch information
priyappillai committed May 14, 2021
2 parents b0f1fd8 + 7c8b01b commit a5b226b
Show file tree
Hide file tree
Showing 19 changed files with 166 additions and 74 deletions.
29 changes: 21 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,15 +101,24 @@ You will need to activate the environment each time you use ADAPT.
## Downloading and installing

ADAPT is available via [Bioconda](https://anaconda.org/bioconda/adapt) for GNU/Linux and Windows operating systems and via [PyPI](https://pypi.org/project/adapt-diagnostics/) for all operating systems.

Before installing ADAPT via Bioconda, we suggest you follow the instructions in [Setting up a conda environment](#setting-up-a-conda-environment) to install Miniconda and activate the environment. To install via Bioconda, run the following command:
```bash
conda install -c bioconda adapt
```
If you want to be able to use AWS cloud features through ADAPT, run the following instead:
```bash
conda install -c bioconda "adapt[AWS]"
```

Before installing ADAPT via PyPI, we suggest you follow the instructions in the [Python documentation](https://docs.python.org/3/tutorial/venv.html) to set up and activate a virtual environment for ADAPT. To install via PyPI, run the following command:
Before installing ADAPT via PyPI, we suggest you follow the instructions in either the [Python documentation](https://docs.python.org/3/tutorial/venv.html) or [Setting up a conda environment](#setting-up-a-conda-environment) to set up and activate a virtual environment for ADAPT. To install via PyPI, run the following command:
```bash
pip install adapt-diagnostics
```
If you want to be able to use AWS cloud features through ADAPT, run the following instead:
```bash
pip install "adapt-diagnostics[AWS]"
```

If you wish to modify ADAPT's code, ADAPT can be installed by cloning the repository and installing the package with `pip`:
```bash
Expand Down Expand Up @@ -265,7 +274,7 @@ The value depends on the output values of the activity model and reflects a tole
'random-greedy' uses a randomized greedy algorithm (Buchbinder 2014) for constrained non-monotone submodular maximization, which has good worst-case guarantees.
(Default: 'random-greedy'.)

Note that, when the objective is to maximize activity, this objective requires a predictive model of activity and thus `--predict-activity-model-path` should be specified (details in [Miscellaneous key arguments](#miscellaneous-key-arguments)).
Note that, when the objective is to maximize activity, this objective requires a predictive model of activity and thus `--predict-activity-model-path` or `--predict-cas13a-activity-model` should be specified (details in [Miscellaneous key arguments](#miscellaneous-key-arguments)).
If you wish to use this objective but cannot use our pre-trained Cas13a model nor another model, see the help message for the argument `--use-simple-binary-activity-prediction`.

### Objective: minimizing complexity
Expand All @@ -275,15 +284,15 @@ With this objective, the following arguments to [`design.py`](./bin/design.py) a

* `-gm MISMATCHES`: Tolerate up to MISMATCHES mismatches when determining whether a guide detects a sequence.
This argument is mainly meant to be helpful in the absence of a predictive model of activity.
When using a predictive model of activity (via `--predict-activity-model-path` and `--predict-activity-thres`), this argument serves as an additional requirement for evaluating detection on top of the model; it can be effectively ignored by setting MISMATCHES to be sufficiently high.
When using a predictive model of activity (via `--predict-activity-model-path` or `--predict-cas13a-activity-model`), this argument serves as an additional requirement for evaluating detection on top of the model; it can be effectively ignored by setting MISMATCHES to be sufficiently high.
(Default: 0.)
* `--predict-activity-thres THRES_C THRES_R`: Thresholds for determining whether a guide-target pair is active and highly active.
THRES_C is a decision threshold on the output of the classifier (in \[0,1\]); predictions above this threshold are decided to be active.
Higher values have higher precision and less recall.
THRES_R is a decision threshold on the output of the regression model (at least 0); predictions above this threshold are decided to be highly active.
Higher values limit the number of pairs determined to be highly active.
To count as detecting a target sequence, a guide must be: (i) within MISMATCHES mismatches of the target sequence; (ii) classified as active; and (iii) predicted to be highly active.
Using this argument requires also setting `--predict-activity-model-path` (see [Miscellaneous key arguments](#miscellaneous-key-arguments)).
Using this argument requires also setting `--predict-activity-model-path` or `--predict-cas13a-activity-model` (see [Miscellaneous key arguments](#miscellaneous-key-arguments)).
As noted above, MISMATCHES can be set to be sufficiently high to effectively ignore `-gm`.
(Default: use the default thresholds included with the model.)
* `-gp COVER_FRAC`: Design guides such that at least a fraction COVER_FRAC of the genomes are detected by the guides.
Expand Down Expand Up @@ -375,10 +384,13 @@ If AWS CLI has been installed and configured and these arguments are passed, the

In addition to the arguments above, there are others that are often important when running [`design.py`](./bin/design.py):

* `--predict-activity-model-path MODEL_C MODEL_R`: Modles that predict activity of guide-target pairs.
* `--predict-cas13a-activity-model`: If set, use ADAPT's pre-trained Cas13 model to predict activity of guide-target pairs.
Classification and regression model files can be viewed in [`models/`](./models).
(Default: not set, which does not use predicted activity during design.)
* `--predict-activity-model-path MODEL_C MODEL_R`: Models that predict activity of guide-target pairs.
MODEL_C gives a classification model that predicts whether a guide-target pair is active, and MODEL_R gives a regression model that predicts a measure of activity on active pairs.
This does not need to be set if `--predict-cas13a-activity-model` is specified, but it is useful for custom models.
Each argument is a path to a serialized model in TensorFlow's SavedModel format.
Pre-trained classification and regression models are in [`models/`](./models).
With `--obj maximize-activity`, the models are essential because they inform ADAPT of the measurements it aims to maximize.
With `--obj minimize-guides`, the models constrain the design such that a guide must be highly active to detect a sequence (specified by `--predict-activity-thres`).
(Default: not set, which does not use predicted activity during design.)
Expand Down Expand Up @@ -459,9 +471,10 @@ This is the most simple example.
**It does not download genomes, search for genomic regions to target, or use a predictive model of activity; for these features, see the next example.**

The repository includes an alignment of Lassa virus sequences (S segment) from Sierra Leone in `examples/SLE_S.aligned.fasta`.
If you have installed ADAPT via Bioconda or PyPI, you'll need to download the alignment from [`here`](https://raw.githubusercontent.com/broadinstitute/adapt/main/examples/SLE_S.aligned.fasta).
Run:
```bash
design.py sliding-window fasta examples/SLE_S.aligned.fasta -o probes.tsv -w 200 -gl 28 -gm 1 -gp 0.95
design.py sliding-window fasta FASTA_PATH -o probes.tsv -w 200 -gl 28 -gm 1 -gp 0.95
```

From this alignment, ADAPT scans each 200 nt window (`-w 200`) to find the smallest collection of probes that:
Expand All @@ -479,7 +492,7 @@ It identifies Cas13a guides using a pre-trained predictive model of activity.

Run:
```bash
design.py complete-targets auto-from-args 64320 None guides.tsv -gl 28 --obj maximize-activity -pl 30 -pm 1 -pp 0.95 --predict-activity-model-path models/classify/model-51373185 models/regress/model-f8b6fd5d --best-n-targets 5 --mafft-path MAFFT_PATH --sample-seqs 50 --verbose
design.py complete-targets auto-from-args 64320 None guides.tsv -gl 28 --obj maximize-activity -pl 30 -pm 1 -pp 0.95 --predict-cas13a-activity-model --best-n-targets 5 --mafft-path MAFFT_PATH --sample-seqs 50 --verbose
```
This downloads and designs assays to detect genomes of Zika virus (NCBI taxonomy ID [64320](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=64320)).
You must fill in `MAFFT_PATH` with an executable of MAFFT.
Expand Down
12 changes: 6 additions & 6 deletions adapt/utils/predict_activity.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,9 @@ def __init__(self, classification_model_path, regression_model_path,
# Load context_nt; this should be the same for the classification
# and regression models
classification_context_nt_path = os.path.join(
classification_model_path, 'assets.extra/context_nt.arg')
classification_model_path, 'assets.extra', 'context_nt.arg')
regression_context_nt_path = os.path.join(
regression_model_path, 'assets.extra/context_nt.arg')
regression_model_path, 'assets.extra', 'context_nt.arg')
if not os.path.isfile(classification_context_nt_path):
raise Exception(("Unknown context_nt for classification model; "
"the model should have a assets.extra/context_nt.arg file"))
Expand All @@ -100,9 +100,9 @@ def __init__(self, classification_model_path, regression_model_path,
# Load guide_length; this should be the same for the classification
# and regression models
classification_guide_length_path = os.path.join(
classification_model_path, 'assets.extra/guide_length.arg')
classification_model_path, 'assets.extra', 'guide_length.arg')
regression_guide_length_path = os.path.join(
regression_model_path, 'assets.extra/guide_length.arg')
regression_model_path, 'assets.extra', 'guide_length.arg')
if not os.path.isfile(classification_guide_length_path):
raise Exception(("Unknown guide_length for classification model; "
"the model should have a assets.extra/guide_length.arg file"))
Expand All @@ -126,7 +126,7 @@ def __init__(self, classification_model_path, regression_model_path,
# Read default threshold
classification_default_threshold_path = os.path.join(
classification_model_path,
'assets.extra/default_threshold.arg')
'assets.extra', 'default_threshold.arg')
if not os.path.isfile(classification_default_threshold_path):
raise Exception(("Unknown default threshold for classification "
"model; the model should have a "
Expand All @@ -143,7 +143,7 @@ def __init__(self, classification_model_path, regression_model_path,
# Read default threshold
regression_default_threshold_path = os.path.join(
regression_model_path,
'assets.extra/default_threshold.arg')
'assets.extra', 'default_threshold.arg')
if not os.path.isfile(regression_default_threshold_path):
raise Exception(("Unknown default threshold for regression "
"model; the model should have a "
Expand Down
20 changes: 13 additions & 7 deletions adapt/utils/tests/test_predict_activity.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
import unittest

import numpy as np
import os

from adapt import alignment
from adapt.utils import predict_activity
from adapt.utils.version import get_project_path, get_latest_model_version

__author__ = 'Hayden Metsky <hayden@mit.edu>'

Expand All @@ -17,11 +19,15 @@ class TestPredictor(unittest.TestCase):

def setUp(self):
# Use the provided models with default thresholds
classification_model_path = 'models/classify/model-51373185'
regression_model_path = 'models/regress/model-f8b6fd5d'
self.predictor = predict_activity.Predictor(
classification_model_path,
regression_model_path)
dir_path = get_project_path()
cla_path_all = os.path.join(dir_path, 'models', 'classify', 'cas13a')
reg_path_all = os.path.join(dir_path, 'models', 'regress', 'cas13a')
cla_version = get_latest_model_version(cla_path_all)
reg_version = get_latest_model_version(reg_path_all)
cla_path = os.path.join(cla_path_all, cla_version)
reg_path = os.path.join(reg_path_all, reg_version)

self.predictor = predict_activity.Predictor(cla_path, reg_path)

def test_model_input_from_nt(self):
# Make context (both ends) be all 'A'
Expand Down Expand Up @@ -69,7 +75,7 @@ def test_classify_and_decide(self):
target_with_context_2 = ('A'*self.predictor.context_nt +
'G'*28 + 'A'*self.predictor.context_nt)
guide_2 = 'G'*28

pairs = [(target_with_context_1, guide_1), (target_with_context_2,
guide_2)]
pairs_onehot = self.predictor._model_input_from_nt(pairs)
Expand All @@ -91,7 +97,7 @@ def test_regress(self):
target_with_context_2 = ('A'*self.predictor.context_nt +
'A'*28 + 'A'*self.predictor.context_nt)
guide_2 = 'A'*28

pairs = [(target_with_context_1, guide_1), (target_with_context_2,
guide_2)]
pairs_onehot = self.predictor._model_input_from_nt(pairs)
Expand Down
38 changes: 35 additions & 3 deletions adapt/utils/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,17 @@


def get_project_path():
"""Determine absolute path to the top-level of the catch project.
"""Determine absolute path to the top-level of the project.
This is assumed to be the parent of the directory containing this script.
Returns:
path (string) to top-level of the catch project
path (string) to top-level of the project
"""
# abspath converts relative to absolute path; expanduser interprets ~
path = __file__ # path to this script
path = os.path.expanduser(path) # interpret ~
path = os.path.abspath(path) # convert to absolute path
path = os.path.dirname(path) # containing directory: utils
path = os.path.dirname(path) # containing directory: catch project dir
path = os.path.dirname(path) # containing directory: project dir
return path


Expand Down Expand Up @@ -118,6 +118,38 @@ def get_version():
return __version__


def get_latest_model_version(model_path):
"""Get latest model version, given the model path
"""
# List all model versions in path
model_versions = os.listdir(model_path)
# Get a list of the versions
# Each version is represented as a list of numbers
model_versions_numeric = []
for model_version in model_versions:
if model_version.startswith('v'):
model_version_numeric = []
skip = False
for i in model_version[1:].split('_'):
if not i.isdecimal():
skip = True
break
else:
model_version_numeric.append(int(i))
if not skip and len(model_version_numeric) > 0:
model_versions_numeric.append(model_version_numeric)

# If there were no models found on the path, raise an error
if len(model_versions_numeric) == 0:
raise ValueError("There are no appropriately formatted models in the "
"model path. Please make sure the models are in a folder with the "
"format 'v_#_#'")

# Remake the version string
latest_version = [str(i) for i in sorted(model_versions_numeric)[-1]]
return 'v' + '_'.join(latest_version)


if __name__ == "__main__":
# Determine and print the package version
print(get_version())
Loading

0 comments on commit a5b226b

Please sign in to comment.