Merge pull request #52 from broadinstitute/include_cas13

Make the README examples work-Cas13a argument; include models in package; edit README
broadinstitute · May 14, 2021 · 7c8b01b · 7c8b01b
2 parents cb31621 + cb6758d
commit 7c8b01b
Show file tree

Hide file tree

Showing 19 changed files with 167 additions and 75 deletions.
diff --git a/README.md b/README.md
@@ -101,15 +101,24 @@ You will need to activate the environment each time you use ADAPT.
 ## Downloading and installing
 
 ADAPT is available via [Bioconda](https://anaconda.org/bioconda/adapt) for GNU/Linux and Windows operating systems and via [PyPI](https://pypi.org/project/adapt-diagnostics/) for all operating systems.
+
 Before installing ADAPT via Bioconda, we suggest you follow the instructions in [Setting up a conda environment](#setting-up-a-conda-environment) to install Miniconda and activate the environment. To install via Bioconda, run the following command:
 ```bash
 conda install -c bioconda adapt
 ```
+If you want to be able to use AWS cloud features through ADAPT, run the following instead:
+```bash
+conda install -c bioconda "adapt[AWS]"
+```
 
-Before installing ADAPT via PyPI, we suggest you follow the instructions in the [Python documentation](https://docs.python.org/3/tutorial/venv.html) to set up and activate a virtual environment for ADAPT. To install via PyPI, run the following command:
+Before installing ADAPT via PyPI, we suggest you follow the instructions in either the [Python documentation](https://docs.python.org/3/tutorial/venv.html) or [Setting up a conda environment](#setting-up-a-conda-environment) to set up and activate a virtual environment for ADAPT. To install via PyPI, run the following command:
 ```bash
 pip install adapt-diagnostics
 ```
+If you want to be able to use AWS cloud features through ADAPT, run the following instead:
+```bash
+pip install "adapt-diagnostics[AWS]"
+```
 
 If you wish to modify ADAPT's code, ADAPT can be installed by cloning the repository and installing the package with `pip`:
 ```bash
@@ -265,7 +274,7 @@ The value depends on the output values of the activity model and reflects a tole
 'random-greedy' uses a randomized greedy algorithm (Buchbinder 2014) for constrained non-monotone submodular maximization, which has good worst-case guarantees.
 (Default: 'random-greedy'.)
 
-Note that, when the objective is to maximize activity, this objective requires a predictive model of activity and thus `--predict-activity-model-path` should be specified (details in [Miscellaneous key arguments](#miscellaneous-key-arguments)).
+Note that, when the objective is to maximize activity, this objective requires a predictive model of activity and thus `--predict-activity-model-path` or `--predict-cas13a-activity-model` should be specified (details in [Miscellaneous key arguments](#miscellaneous-key-arguments)).
 If you wish to use this objective but cannot use our pre-trained Cas13a model nor another model, see the help message for the argument `--use-simple-binary-activity-prediction`.
 
 ### Objective: minimizing complexity
@@ -275,15 +284,15 @@ With this objective, the following arguments to [`design.py`](./bin/design.py) a
 
 * `-gm MISMATCHES`: Tolerate up to MISMATCHES mismatches when determining whether a guide detects a sequence.
 This argument is mainly meant to be helpful in the absence of a predictive model of activity.
-When using a predictive model of activity (via `--predict-activity-model-path` and `--predict-activity-thres`), this argument serves as an additional requirement for evaluating detection on top of the model; it can be effectively ignored by setting MISMATCHES to be sufficiently high.
+When using a predictive model of activity (via `--predict-activity-model-path` or `--predict-cas13a-activity-model`), this argument serves as an additional requirement for evaluating detection on top of the model; it can be effectively ignored by setting MISMATCHES to be sufficiently high.
 (Default: 0.)
 * `--predict-activity-thres THRES_C THRES_R`: Thresholds for determining whether a guide-target pair is active and highly active.
 THRES_C is a decision threshold on the output of the classifier (in \[0,1\]); predictions above this threshold are decided to be active.
 Higher values have higher precision and less recall.
 THRES_R is a decision threshold on the output of the regression model (at least 0); predictions above this threshold are decided to be highly active.
 Higher values limit the number of pairs determined to be highly active.
 To count as detecting a target sequence, a guide must be: (i) within MISMATCHES mismatches of the target sequence; (ii) classified as active; and (iii) predicted to be highly active.
-Using this argument requires also setting `--predict-activity-model-path` (see [Miscellaneous key arguments](#miscellaneous-key-arguments)).
+Using this argument requires also setting `--predict-activity-model-path` or `--predict-cas13a-activity-model` (see [Miscellaneous key arguments](#miscellaneous-key-arguments)).
 As noted above, MISMATCHES can be set to be sufficiently high to effectively ignore `-gm`.
 (Default: use the default thresholds included with the model.)
 * `-gp COVER_FRAC`: Design guides such that at least a fraction COVER_FRAC of the genomes are detected by the guides.
@@ -375,10 +384,13 @@ If AWS CLI has been installed and configured and these arguments are passed, the
 
 In addition to the arguments above, there are others that are often important when running [`design.py`](./bin/design.py):
 
-* `--predict-activity-model-path MODEL_C MODEL_R`: Modles that predict activity of guide-target pairs.
+* `--predict-cas13a-activity-model`: If set, use ADAPT's pre-trained Cas13 model to predict activity of guide-target pairs.
+Classification and regression model files can be viewed in [`models/`](./models).
+(Default: not set, which does not use predicted activity during design.)
+* `--predict-activity-model-path MODEL_C MODEL_R`: Models that predict activity of guide-target pairs.
 MODEL_C gives a classification model that predicts whether a guide-target pair is active, and MODEL_R gives a regression model that predicts a measure of activity on active pairs.
+This does not need to be set if `--predict-cas13a-activity-model` is specified, but it is useful for custom models.
 Each argument is a path to a serialized model in TensorFlow's SavedModel format.
-Pre-trained classification and regression models are in [`models/`](./models).
 With `--obj maximize-activity`, the models are essential because they inform ADAPT of the measurements it aims to maximize.
 With `--obj minimize-guides`, the models constrain the design such that a guide must be highly active to detect a sequence (specified by `--predict-activity-thres`).
 (Default: not set, which does not use predicted activity during design.)
@@ -459,9 +471,10 @@ This is the most simple example.
 **It does not download genomes, search for genomic regions to target, or use a predictive model of activity; for these features, see the next example.**
 
 The repository includes an alignment of Lassa virus sequences (S segment) from Sierra Leone in `examples/SLE_S.aligned.fasta`.
+If you have installed ADAPT via Bioconda or PyPI, you'll need to download the alignment from [`here`](https://raw.githubusercontent.com/broadinstitute/adapt/main/examples/SLE_S.aligned.fasta).
 Run:
 ```bash
-design.py sliding-window fasta examples/SLE_S.aligned.fasta -o probes.tsv -w 200 -gl 28 -gm 1 -gp 0.95
+design.py sliding-window fasta FASTA_PATH -o probes.tsv -w 200 -gl 28 -gm 1 -gp 0.95
 ```
 
 From this alignment, ADAPT scans each 200 nt window (`-w 200`) to find the smallest collection of probes that:
@@ -479,7 +492,7 @@ It identifies Cas13a guides using a pre-trained predictive model of activity.
 
 Run:
 ```bash
-design.py complete-targets auto-from-args 64320 None guides.tsv -gl 28 --obj maximize-activity -pl 30 -pm 1 -pp 0.95 --predict-activity-model-path models/classify/model-51373185 models/regress/model-f8b6fd5d --best-n-targets 5 --mafft-path MAFFT_PATH --sample-seqs 50 --verbose
+design.py complete-targets auto-from-args 64320 None guides.tsv -gl 28 --obj maximize-activity -pl 30 -pm 1 -pp 0.95 --predict-cas13a-activity-model --best-n-targets 5 --mafft-path MAFFT_PATH --sample-seqs 50 --verbose
 ```
 This downloads and designs assays to detect genomes of Zika virus (NCBI taxonomy ID [64320](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=64320)).
 You must fill in `MAFFT_PATH` with an executable of MAFFT.

diff --git a/...odel-51373185/assets.extra/context_nt.arg → ...y/cas13a/v1_0/assets.extra/context_nt.arg b/...odel-51373185/assets.extra/context_nt.arg → ...y/cas13a/v1_0/assets.extra/context_nt.arg
diff --git a/...373185/assets.extra/default_threshold.arg → ...a/v1_0/assets.extra/default_threshold.arg b/...373185/assets.extra/default_threshold.arg → ...a/v1_0/assets.extra/default_threshold.arg
diff --git a/...el-51373185/assets.extra/guide_length.arg → ...cas13a/v1_0/assets.extra/guide_length.arg b/...el-51373185/assets.extra/guide_length.arg → ...cas13a/v1_0/assets.extra/guide_length.arg
diff --git a/...ls/classify/model-51373185/saved_model.pb → ...odels/classify/cas13a/v1_0/saved_model.pb b/...ls/classify/model-51373185/saved_model.pb → ...odels/classify/cas13a/v1_0/saved_model.pb
diff --git a/...5/variables/variables.data-00000-of-00001 → ...0/variables/variables.data-00000-of-00001 b/...5/variables/variables.data-00000-of-00001 → ...0/variables/variables.data-00000-of-00001
diff --git a/.../model-51373185/variables/variables.index → ...ify/cas13a/v1_0/variables/variables.index b/.../model-51373185/variables/variables.index → ...ify/cas13a/v1_0/variables/variables.index
diff --git a/...odel-f8b6fd5d/assets.extra/context_nt.arg → ...s/cas13a/v1_0/assets.extra/context_nt.arg b/...odel-f8b6fd5d/assets.extra/context_nt.arg → ...s/cas13a/v1_0/assets.extra/context_nt.arg
diff --git a/...b6fd5d/assets.extra/default_threshold.arg → ...a/v1_0/assets.extra/default_threshold.arg b/...b6fd5d/assets.extra/default_threshold.arg → ...a/v1_0/assets.extra/default_threshold.arg
diff --git a/...el-f8b6fd5d/assets.extra/guide_length.arg → ...cas13a/v1_0/assets.extra/guide_length.arg b/...el-f8b6fd5d/assets.extra/guide_length.arg → ...cas13a/v1_0/assets.extra/guide_length.arg
diff --git a/models/regress/model-f8b6fd5d/saved_model.pb → ...models/regress/cas13a/v1_0/saved_model.pb b/models/regress/model-f8b6fd5d/saved_model.pb → ...models/regress/cas13a/v1_0/saved_model.pb
diff --git a/...d/variables/variables.data-00000-of-00001 → ...0/variables/variables.data-00000-of-00001 b/...d/variables/variables.data-00000-of-00001 → ...0/variables/variables.data-00000-of-00001
diff --git a/.../model-f8b6fd5d/variables/variables.index → ...ess/cas13a/v1_0/variables/variables.index b/.../model-f8b6fd5d/variables/variables.index → ...ess/cas13a/v1_0/variables/variables.index
diff --git a/adapt/utils/predict_activity.py b/adapt/utils/predict_activity.py
@@ -79,9 +79,9 @@ def __init__(self, classification_model_path, regression_model_path,
         # Load context_nt; this should be the same for the classification
         # and regression models
         classification_context_nt_path = os.path.join(
-                classification_model_path, 'assets.extra/context_nt.arg')
+                classification_model_path, 'assets.extra', 'context_nt.arg')
         regression_context_nt_path = os.path.join(
-                regression_model_path, 'assets.extra/context_nt.arg')
+                regression_model_path, 'assets.extra', 'context_nt.arg')
         if not os.path.isfile(classification_context_nt_path):
             raise Exception(("Unknown context_nt for classification model; "
                 "the model should have a assets.extra/context_nt.arg file"))
@@ -100,9 +100,9 @@ def __init__(self, classification_model_path, regression_model_path,
         # Load guide_length; this should be the same for the classification
         # and regression models
         classification_guide_length_path = os.path.join(
-                classification_model_path, 'assets.extra/guide_length.arg')
+                classification_model_path, 'assets.extra', 'guide_length.arg')
         regression_guide_length_path = os.path.join(
-                regression_model_path, 'assets.extra/guide_length.arg')
+                regression_model_path, 'assets.extra', 'guide_length.arg')
         if not os.path.isfile(classification_guide_length_path):
             raise Exception(("Unknown guide_length for classification model; "
                 "the model should have a assets.extra/guide_length.arg file"))
@@ -126,7 +126,7 @@ def __init__(self, classification_model_path, regression_model_path,
             # Read default threshold
             classification_default_threshold_path = os.path.join(
                     classification_model_path,
-                    'assets.extra/default_threshold.arg')
+                    'assets.extra', 'default_threshold.arg')
             if not os.path.isfile(classification_default_threshold_path):
                 raise Exception(("Unknown default threshold for classification "
                     "model; the model should have a "
@@ -143,7 +143,7 @@ def __init__(self, classification_model_path, regression_model_path,
             # Read default threshold
             regression_default_threshold_path = os.path.join(
                     regression_model_path,
-                    'assets.extra/default_threshold.arg')
+                    'assets.extra', 'default_threshold.arg')
             if not os.path.isfile(regression_default_threshold_path):
                 raise Exception(("Unknown default threshold for regression "
                     "model; the model should have a "

diff --git a/adapt/utils/tests/test_predict_activity.py b/adapt/utils/tests/test_predict_activity.py
@@ -4,9 +4,11 @@
 import unittest
 
 import numpy as np
+import os
 
 from adapt import alignment
 from adapt.utils import predict_activity
+from adapt.utils.version import get_project_path, get_latest_model_version
 
 __author__ = 'Hayden Metsky <hayden@mit.edu>'
 
@@ -17,11 +19,15 @@ class TestPredictor(unittest.TestCase):
 
     def setUp(self):
         # Use the provided models with default thresholds
-        classification_model_path = 'models/classify/model-51373185'
-        regression_model_path = 'models/regress/model-f8b6fd5d'
-        self.predictor = predict_activity.Predictor(
-                classification_model_path,
-                regression_model_path)
+        dir_path = get_project_path()
+        cla_path_all = os.path.join(dir_path, 'models', 'classify', 'cas13a')
+        reg_path_all = os.path.join(dir_path, 'models', 'regress', 'cas13a')
+        cla_version = get_latest_model_version(cla_path_all)
+        reg_version = get_latest_model_version(reg_path_all)
+        cla_path = os.path.join(cla_path_all, cla_version)
+        reg_path = os.path.join(reg_path_all, reg_version)
+
+        self.predictor = predict_activity.Predictor(cla_path, reg_path)
 
     def test_model_input_from_nt(self):
         # Make context (both ends) be all 'A'
@@ -69,7 +75,7 @@ def test_classify_and_decide(self):
         target_with_context_2 = ('A'*self.predictor.context_nt +
                 'G'*28 + 'A'*self.predictor.context_nt)
         guide_2 = 'G'*28
-        
+
         pairs = [(target_with_context_1, guide_1), (target_with_context_2,
             guide_2)]
         pairs_onehot = self.predictor._model_input_from_nt(pairs)
@@ -91,7 +97,7 @@ def test_regress(self):
         target_with_context_2 = ('A'*self.predictor.context_nt +
                 'A'*28 + 'A'*self.predictor.context_nt)
         guide_2 = 'A'*28
-        
+
         pairs = [(target_with_context_1, guide_1), (target_with_context_2,
             guide_2)]
         pairs_onehot = self.predictor._model_input_from_nt(pairs)

diff --git a/adapt/utils/version.py b/adapt/utils/version.py
@@ -23,17 +23,17 @@
 
 
 def get_project_path():
-    """Determine absolute path to the top-level of the catch project.
+    """Determine absolute path to the top-level of the project.
     This is assumed to be the parent of the directory containing this script.
     Returns:
-        path (string) to top-level of the catch project
+        path (string) to top-level of the project
     """
     # abspath converts relative to absolute path; expanduser interprets ~
     path = __file__  # path to this script
     path = os.path.expanduser(path)  # interpret ~
     path = os.path.abspath(path)  # convert to absolute path
     path = os.path.dirname(path)  # containing directory: utils
-    path = os.path.dirname(path)  # containing directory: catch project dir
+    path = os.path.dirname(path)  # containing directory: project dir
     return path
 
 
@@ -118,6 +118,38 @@ def get_version():
     return __version__
 
 
+def get_latest_model_version(model_path):
+    """Get latest model version, given the model path
+    """
+    # List all model versions in path
+    model_versions = os.listdir(model_path)
+    # Get a list of the versions
+    # Each version is represented as a list of numbers
+    model_versions_numeric = []
+    for model_version in model_versions:
+        if model_version.startswith('v'):
+            model_version_numeric = []
+            skip = False
+            for i in model_version[1:].split('_'):
+                if not i.isdecimal():
+                    skip = True
+                    break
+                else:
+                    model_version_numeric.append(int(i))
+            if not skip and len(model_version_numeric) > 0:
+                model_versions_numeric.append(model_version_numeric)
+
+    # If there were no models found on the path, raise an error
+    if len(model_versions_numeric) == 0:
+        raise ValueError("There are no appropriately formatted models in the "
+            "model path. Please make sure the models are in a folder with the "
+            "format 'v_#_#'")
+
+    # Remake the version string
+    latest_version = [str(i) for i in sorted(model_versions_numeric)[-1]]
+    return 'v' + '_'.join(latest_version)
+
+
 if __name__ == "__main__":
     # Determine and print the package version
     print(get_version())