Skip to content

Commit

Permalink
Merge pull request #14 from broadinstitute/custom-covg-fn
Browse files Browse the repository at this point in the history
Expose option to specify a custom hybridization model
  • Loading branch information
haydenm authored Jul 5, 2018
2 parents fe63b86 + b8ef9f2 commit 544bce7
Show file tree
Hide file tree
Showing 11 changed files with 347 additions and 26 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ design.py [dataset] [dataset ...]
Each `dataset` can be a path to a FASTA file. If you [downloaded](#downloading-viral-sequence-data) viral sequence data, it can also simply be a label for one of [350+ viral datasets](./catch/datasets/README.md) (e.g., `hiv1` or `zika`) distributed as part of this package.
Each of these datasets includes all available whole genomes (genome neighbors) in [NCBI's viral genome data](https://www.ncbi.nlm.nih.gov/genome/viruses/) that have human as a host, for one or more species, as of Sep. 2017.

Below are some commonly used arguments to `design.py`:
Below is a summary of some useful arguments to `design.py`:

* `-pl PROBE_LENGTH`/`-ps PROBE_STRIDE`: Design probes to be PROBE_LENGTH nt long, and generate candidate probes using a stride of PROBE_STRIDE nt.
(Default: 100 and 50.)
Expand All @@ -116,6 +116,10 @@ Probes are designed such that each `dataset` should be captured by probes that a
* `--add-adapters`: Add PCR adapters to the ends of each probe sequence.
This selects adapters to add to probe sequences so as to minimize overlap among probes that share an adapter, allowing probes with the same adapter to be amplified together.
(See `--adapter-a` and `--adapter-b` too.)
* `--custom-hybridization-fn PATH FN`: Specify a function, for CATCH to dynamically load, that implements a custom model of hybridization between a probe and target sequence.
See `design.py --help` for details on the expected input and output of this function.
If not set, CATCH uses its default model of hybridization based on `-m/--mismatches`, `-l/--lcf-thres`, and `--island-of-exact-match`.
(Relatedly, see `--custom-hybridization-fn-tolerant`.)
* `--filter-with-lsh-hamming FILTER_WITH_LSH_HAMMING`/`--filter-with-lsh-minhash FILTER_WITH_LSH_MINHASH`: Use locality-sensitive hashing to reduce the space of candidate probes.
This can significantly improve runtime and memory requirements when the input is especially large and diverse.
See `design.py --help` for details on using these options and downsides.
Expand Down
62 changes: 61 additions & 1 deletion bin/design.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,16 @@ def main(args):
"--adapter-a and --adapter-b, but --add-adapters is required "
"to add adapter sequences onto the ends of probes"))

# Check for whether a custom hybridization function was provided
if args.custom_hybridization_fn:
custom_cover_range_fn = tuple(args.custom_hybridization_fn)
else:
custom_cover_range_fn = None
if args.custom_hybridization_fn_tolerant:
custom_cover_range_tolerant_fn = tuple(args.custom_hybridization_fn_tolerant)
else:
custom_cover_range_tolerant_fn = None

# Setup the filters
# The filters we use are, in order:
filters = []
Expand Down Expand Up @@ -173,6 +183,8 @@ def main(args):
mismatches_tolerant=args.mismatches_tolerant,
lcf_thres_tolerant=args.lcf_thres_tolerant,
island_of_exact_match_tolerant=args.island_of_exact_match_tolerant,
custom_cover_range_fn=custom_cover_range_fn,
custom_cover_range_tolerant_fn=custom_cover_range_tolerant_fn,
identify=args.identify,
blacklisted_genomes=blacklisted_genomes_fasta,
coverage=args.coverage,
Expand Down Expand Up @@ -200,7 +212,9 @@ def main(args):
mismatches=args.mismatches,
lcf_thres=args.lcf_thres,
island_of_exact_match=\
args.island_of_exact_match)
args.island_of_exact_match,
custom_cover_range_fn=\
custom_cover_range_fn)
filters += [af]

# [Optional]
Expand Down Expand Up @@ -241,6 +255,7 @@ def main(args):
genomes_grouped,
genomes_grouped_names,
island_of_exact_match=args.island_of_exact_match,
custom_cover_range_fn=custom_cover_range_fn,
cover_extension=args.cover_extension,
rc_too=args.add_reverse_complements)
analyzer.run()
Expand Down Expand Up @@ -303,6 +318,43 @@ def main(args):
"MATCH nt between a portion of the probe and a portion "
"of the sequence"))

# Custom function (dynamically loaded) to determine probe hybridization
# When set, this makes values of the above arguments (--mismatches,
# --lcf-thres, and --island-of-exact-match) meaningless
parser.add_argument('--custom-hybridization-fn',
nargs=2,
help=("(Optional) Args: <PATH> <FUNC>; PATH is a path to a Python "
"module (.py file) and FUNC is a string giving the name of "
"a function in that module. FUNC provides a custom model of "
"hybridization between a probe and target sequence to use in "
"the probe set design. If this is set, the arguments "
"--mismatches, --lcf-thres, and --island-of-exact-match are "
"not used because these are meant for the default model of "
"hybridization. The function FUNC in PATH is dynamically "
"loaded to use when determining whether a probe hybridizes to "
"a target sequence (and, if so, what portion). FUNC must "
"accept the following arguments in order, though it "
"may choose to ignore some values: (1) array giving sequence "
"of a probe; (2) str giving subsequence of target sequence to "
"which the probe may hybridize, of the same length as the "
"given probe sequence; (3) int giving the position in the "
"probe (equivalently, the target subsequence) of the start "
"of a k-mer around which the probe and target subsequence "
"are anchored (the probe and target subsequence are aligned "
"using this k-mer as an anchor); (4) int giving the end "
"position (exclusive) of the anchor k-mer; (5) int giving the "
"full length of the probe (the probe provided in (1) may be "
"cutoff on an end if it extends further than where the "
"target sequence ends); (6) int giving the full length of the "
"target sequence of which the subsequence in (2) is part. "
"FUNC must return None if it deems that the probe does not "
"hybridize to the target subsequence; otherwise, it must "
"return a tuple (start, end) where start is an int giving "
"the start position in the probe (equivalently, in the "
"target subsequence) at which the probe will hybridize to "
"the target subsequence, and end is an int (exclusive) giving "
"the end position of the hybridization."))

# Desired coverage of target genomes
def check_coverage(val):
fval = float(val)
Expand Down Expand Up @@ -378,6 +430,14 @@ def check_coverage(val):
"possible hybridizations (i.e., more sensitivity) "
"when designing probes for identification or when "
"genomes are blacklisted."))
parser.add_argument('--custom-hybridization-fn-tolerant',
nargs=2,
help=("(Optional) A more tolerant model than the one "
"implemented in custom_hybridization_fn. This should capture "
"more possible hybridizations (i.e., be more sensitive) "
"when designing probes for identification or when genomes "
"are blacklisted. See --custom-hybridization-fn for details "
"of how this function should be implemented and provided."))

# Outputting probe sequences and coverage analyses
parser.add_argument('-o', '--output-probes',
Expand Down
39 changes: 34 additions & 5 deletions catch/coverage_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
import numpy as np

from catch import probe
from catch.utils import dynamic_load
from catch.utils import interval
from catch.utils import pretty_print

Expand All @@ -79,6 +80,7 @@ def __init__(self,
target_genomes,
target_genomes_names=None,
island_of_exact_match=0,
custom_cover_range_fn=None,
cover_extension=0,
kmer_probe_map_k=10,
rc_too=True):
Expand All @@ -100,6 +102,18 @@ def __init__(self,
island_of_exact_match: for a probe to hybridize to a sequence,
require that there be an exact match of length at least
'island_of_exact_match'
custom_cover_range_fn: if set, tuple (path, fn) where path gives
a path to a Python module and fn gives the name of a function
in that module. This function is dynamically loaded and used
to determine whether a probe will hybridize to a region of
target sequence (and what portion will hybridize). The
function must accept the same arguments as the function
returned by
probe.probe_covers_sequence_by_longest_common_substring()
and return the same value. When set, the parameters
'mismatches', 'lcf_thres', and 'island_of_exact_match'
are ignored (even if their values are default values)
because they are only used in the default cover_range_fn.
cover_extension: number of bp by which to extend the coverage on
each side of a probe; a probe "covers" the portion of the
sequence that it hybridizes to, as well as 'cover_extension'
Expand All @@ -121,11 +135,26 @@ def __init__(self,
self.target_genomes_names = ["Group %d" % i
for i in range(len(target_genomes))]

self.mismatches = mismatches
self.lcf_thres = lcf_thres
self.cover_range_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches, lcf_thres, island_of_exact_match)
if custom_cover_range_fn is not None:
# Use a custom function to determine whether a probe hybridizes
# to a region of target sequence (and what part hybridizes),
# rather than the default model. Ignore the given values for
# mismatches and lcf_thres (which may be default values) because
# these are only relevant for the default model
self.mismatches, self.lcf_thres = None, None

# Dynamically load the function
fn_path, fn_name = custom_cover_range_fn
self.cover_range_fn = dynamic_load.load_function_from_path(
fn_path, fn_name)
else:
self.mismatches = mismatches
self.lcf_thres = lcf_thres
# Construct a function using the default model of hybridization
self.cover_range_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches, lcf_thres, island_of_exact_match)

self.cover_extension = cover_extension
self.kmer_probe_map_k = kmer_probe_map_k
self.rc_too = rc_too
Expand Down
41 changes: 35 additions & 6 deletions catch/filter/adapter_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@

from catch.filter.base_filter import BaseFilter
from catch import probe
from catch.utils import dynamic_load
from catch.utils import interval

__author__ = 'Hayden Metsky <hayden@mit.edu>'
Expand All @@ -126,6 +127,7 @@ def __init__(self,
mismatches,
lcf_thres,
island_of_exact_match=0,
custom_cover_range_fn=None,
kmer_probe_map_k=20):
"""
Args:
Expand All @@ -141,6 +143,18 @@ def __init__(self,
island_of_exact_match: for a probe to hybridize to a sequence,
require that there be an exact match of length at least
'island_of_exact_match'
custom_cover_range_fn: if set, tuple (path, fn) where path gives
a path to a Python module and fn gives the name of a function
in that module. This function is dynamically loaded and used
to determine whether a probe will hybridize to a region of
target sequence (and what portion will hybridize). The
function must accept the same arguments as the function
returned by
probe.probe_covers_sequence_by_longest_common_substring()
and return the same value. When set, the parameters
'mismatches', 'lcf_thres', and 'island_of_exact_match'
are ignored (even if their values are default values)
because they are only used in the default cover_range_fn
kmer_probe_map_k: in calls to probe.construct_kmer_probe_map...,
uses this value as min_k and k
"""
Expand All @@ -151,12 +165,27 @@ def __init__(self,

self.adapter_a_5end, self.adapter_a_3end = adapter_a
self.adapter_b_5end, self.adapter_b_3end = adapter_b
self.mismatches = mismatches
self.lcf_thres = lcf_thres
self.cover_range_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches=mismatches, lcf_thres=lcf_thres,
island_of_exact_match=island_of_exact_match)

if custom_cover_range_fn is not None:
# Use a custom function to determine whether a probe hybridizes
# to a region of target sequence (and what part hybridizes),
# rather than the default model. Ignore the given values for
# mismatches and lcf_thres (which may be default values) because
# these are only relevant for the default model
self.mismatches, self.lcf_thres = None, None

# Dynamically load the function
fn_path, fn_name = custom_cover_range_fn
self.cover_range_fn = dynamic_load.load_function_from_path(
fn_path, fn_name)
else:
self.mismatches = mismatches
self.lcf_thres = lcf_thres
# Construct a function using the default model of hybridization
self.cover_range_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches, lcf_thres, island_of_exact_match)

self.kmer_probe_map_k = kmer_probe_map_k

def _votes_in_sequence(self, probes, sequence):
Expand Down
70 changes: 58 additions & 12 deletions catch/filter/set_cover_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@

from catch.filter.base_filter import BaseFilter
from catch import probe
from catch.utils import dynamic_load
from catch.utils import interval
from catch.utils import seq_io
from catch.utils import set_cover
Expand All @@ -69,6 +70,8 @@ def __init__(self,
mismatches_tolerant=None,
lcf_thres_tolerant=None,
island_of_exact_match_tolerant=None,
custom_cover_range_fn=None,
custom_cover_range_tolerant_fn=None,
identify=False,
blacklisted_genomes=[],
coverage=1.0,
Expand Down Expand Up @@ -100,6 +103,21 @@ def __init__(self,
to 'island_of_exact_match'. It should generally be true
that this value is less than 'island_of_exact_match'. Used
when mismatches_tolerant/lcf_thres_tolerant are used.
custom_cover_range_fn: if set, tuple (path, fn) where path gives
a path to a Python module and fn gives the name of a function
in that module. This function is dynamically loaded and used
to determine whether a probe will hybridize to a region of
target sequence (and what portion will hybridize). The
function must accept the same arguments as the function
returned by
probe.probe_covers_sequence_by_longest_common_substring()
and return the same value. When set, the parameters
'mismatches', 'lcf_thres', and 'island_of_exact_match'
are ignored (even if their values are default values)
because they are only used in the default cover_range_fn.
custom_cover_range_tolerant_fn: same as custom_cover_range_fn,
but with more tolerance for hybridization; likewise, the
_tolerant parameters are ignored when this is set.
identify: when True, indicates that probes should be designed
with the identification option enabled (default is False)
blacklisted_genomes: list of paths to FASTA files of genomes
Expand Down Expand Up @@ -144,25 +162,53 @@ def __init__(self,
depending on the input this can result in considerably
more memory use but may give an improvement in runtime
"""
self.mismatches = mismatches
self.lcf_thres = lcf_thres
self.cover_range_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches, lcf_thres, island_of_exact_match)
if custom_cover_range_fn is not None:
# Use a custom function to determine whether a probe hybridizes
# to a region of target sequence (and what part hybridizes),
# rather than the default model. Ignore the given values for
# mismatches and lcf_thres (which may be default values) because
# these are only relevant for the default model
self.mismatches, self.lcf_thres = None, None

# Dynamically load the function
fn_path, fn_name = custom_cover_range_fn
self.cover_range_fn = dynamic_load.load_function_from_path(
fn_path, fn_name)
else:
self.mismatches = mismatches
self.lcf_thres = lcf_thres
# Construct a function using the default model of hybridization
self.cover_range_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches, lcf_thres, island_of_exact_match)

if not mismatches_tolerant:
mismatches_tolerant = mismatches
if not lcf_thres_tolerant:
lcf_thres_tolerant = lcf_thres
if not island_of_exact_match_tolerant:
island_of_exact_match_tolerant = island_of_exact_match
self.mismatches_tolerant = mismatches_tolerant
self.lcf_thres_tolerant = lcf_thres_tolerant
self.island_of_exact_match_tolerant = island_of_exact_match_tolerant
self.cover_range_tolerant_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches_tolerant, lcf_thres_tolerant,
island_of_exact_match_tolerant)
if custom_cover_range_tolerant_fn is not None:
# Use a custom function to determine, with more tolerance,
# whether a probe hybridizes to a region of target sequence (and
# what part hybridizes), rather than the default model. Ignore
# the given values of mismatches_tolerant and lcf_thres_tolerant
# (which may be default values) because these are only relevant for
# the default model
self.mismatches_tolerant, self.lcf_thres_tolerant = None, None

# Dynamically load the function
fn_path, fn_name = custom_cover_range_tolerant_fn
self.cover_range_tolerant_fn = dynamic_load.load_function_from_path(
fn_path, fn_name)
else:
self.mismatches_tolerant = mismatches_tolerant
self.lcf_thres_tolerant = lcf_thres_tolerant
# Construct a function using the default model of hybridization
self.cover_range_tolerant_fn = \
probe.probe_covers_sequence_by_longest_common_substring(
mismatches_tolerant, lcf_thres_tolerant,
island_of_exact_match_tolerant)

# Warn if identification is enabled but the coverage is high
if identify:
Expand Down
Loading

0 comments on commit 544bce7

Please sign in to comment.