-
Notifications
You must be signed in to change notification settings - Fork 6
Picky Schematics and Internals
Picky takes .fastq input and the pipeline is composed of three steps:
- aligning nanopore reads to the reference genome
- picking the best alignments, and
- calling structural variants.
As always, the devils are in the details.
Picky uses the LAST aligner (v755) to produce all High-scoring Segment Pairs (HSPs) of each nanopore read against the human genome (hg19). HSP is defined as A High-scoring Segment Pair (HSP) is a local alignment with no (large) gaps that achieves one of the highest alignment scores in a given search in in BLAST glossary. For high sensitivity, we adopted the scoring scheme (reward = 1, penalty = -1, gap open = 0, gap extension = 2) used in NCBI megaBLAST.
NOTE: For ONT 1D read technology, the recommended scoring scheme is reward = 3, penalty = -1, gap open = 2, and gap extension = 1 supported by NCBI Blastn.
NOTE: If you are obsess with squeezing out the last ounce of base information, you will have to carefully evaluate your sequencing data characteristics. There are numerous variables that subtlely impact the scoring scheme.
Picky (command: selectRep) produces the read alignment by stitching the (HSP) segments from LAST with a greedy seed-and-extend strategy to maximize the coverage of the read by the selected (co-linear read) segments.
- Spurious aligned segments (%Identity<55 or EG2>1.7e-12) are discarded.
- The remaining segments were ranked according to probability of random hit and alignment score.
- Picky selectRep then performs the seed-and-extend process by
- selecting as seed candidate alignment a highest ranked segment among the remaining segments,
- linking this seed candidate alignment with the remaining segments whose read coordinates were in the vicinity of the candidate seed alignment read coordinates.
- The linked segments produced a linked alignment extension, equivalent to a read alignment, when its total coverage spanned ≥70% of the read.
- Step 3.i-3.iii are repeated until all segments have been selected as seed candidate alignment.
- Linked alignment extensions with a combined score within 90% of the best combined score are considered putative read alignments.
The percentages of matched bases ('=') over the total bases involved in the matched ('='), mismatched ('X'), inserted ('I') and deleted ('D') bases of the CIGAR of the HSP. See section 1.4.6 on CIGAR defintion in Sequence Alignment/Map Format Specification.
Early 2D-read technology has %identity as low as 65%. Empirical observation found that HSPs that can be stitched into a meaningful read alignment tends to be stay within +/-8%. We thus use a simpler number "10%", giving the lowest bound of 55% (,i.e. 65%-10%).
We have observed this metric has improved over time consistently, but we have not tighten the lower bound. This does not impact the results accuracy but does mean that we are not saving on potential reduction in computation resources utilized.
Spurioius alignment tends to be insignificant. The likelihood of the alignment with a specific score by chance is commonly represented by the E-value in NCBI Blast. LAST indicates this by both EG2 and E.
-
EG2 is defined by LAST as the expected number of alignments with greater or equal score, between two randomly-shuffled sequences of length 1 billion (GBases) each.
-
LAST defines E as the expected number of alignments with greater or equal score, between: a random sequence with the same length as the query sequence, and a random sequence with the same length as the database.
We use EG2 as it is much easier to compute and a stricter threshold than E. Specifically, the thresold EG2>1.74743e-12 is equivalent to setting the option -D with a value of 1e20. This translates to "one Hundred quintillion query letters per random alignment."
An alternative more concrete view to this is a minimum alignment score of 82 is needed or the HSP segment is considered random.
The general process calls for a combinatorial algorithm which becomes computational impractical if naively executed. Consequently, a number of 'heuristic' have been adopted.
We observed that significant read alignment (linked alignment extensions) always contain segments from top few most significant HSP segments. In addition, the segments in the read alignment share similar %identity.
Translating these observations into implementation, we first identify the best segment (lowest EG2 (denoted EG2best) and highest score) and use its %identity (denoted %identitybest) to retain HSP segments worth exploring. Specifically, HSP segments satisfaying the following are retained:
- absolute(%identity(HSP segment) - %identitybest) <= 10%, and
- EG2(HSP segment) < 1e-49 (i.e. equivalent to minimal alignment score of 189)
Furthermore, rather than assuming reference segment colinearity, we place this assumption on the read segments. This relaxes the constraints that other aligners have which work well for a "normal" genome, but is likely too strong for a mess up cancer genome. With the emphasis on read coordinate, we group segments into representative segment groups based on their read coordinates to speed up processing. Specifically, a representative segment group may contain three subgroups, namely ‘exact’, ‘subset’, ‘similar’.
- Segments in subgroup ‘exact’ share the exact read start and end coordinates although their genomic coordinates differ.
- Subgroup ‘subset’ contains segment where its read coordinates are just a sub-span of the representative segment for the segment group.
- Members of subgroup ‘similar’ have 95% reciprocal overlap and %identity difference <= 5 with respect to the group’s representative segment.
Even with compressed filtered segment groups, performing the combinatorial seed-and-extend search for read alignment can be slow. Thus, instead of exploring linking with every segment groups as seeds, we designate the segment groups whose EG2=EG2best as the seeds because linked alignment extension generated with them are most likely to be significant.
When there are multiple seeds, i.e. multiple segment groups with the same EG2 values, the seeds are most often part of the high scoring linked alignment extension(s). It is logical that segments that constitute a significant linked alignment extension should all be likely significant. To reduce computation resource, Picky performs the seed-and-extend with the top 200 of the seeds whenever there 5 or more seeds.
Conversely, with limited number of seeds, Picky performs the seed-and-extend with the seed(s) and some bins of the non-seed segment groups. The segment groups are binned as follow:
- Bin #1: segment groups with EG2 = 0 (alignment score>=925)
- Bin #2: segment groups with 0 < EG2 <= 1e-300 (alignment score>=905)
- Bin #3: segment groups with 1e-300 < EG2 <= 1e-200 (alignment score>=620)
- Bin #4: segment groups with 1e-200 < EG2 <= 1e-100 (alignment score>=334)
- Bin #5: segment groups with 1e-100 < EG2 <= 1e-50 (alignment score>=191)
Bin #1 is always considered in the linking, with the successive inclusion of the 4 remaining segment group bins (i.e. Bin #2, #3, #4 and finally #5), as long as the total number of non-seed segment groups do not exceed 100. The limit of 100 is abitrary to put a bound on the compute time but has worked for us.
For a segment group to be linked to the seed or linked alignment extension, it must extend the current read coverage by at least 200 bases, and the overlap between the extender and the current combined segments must account for < 50% of each length. Both criteria aim to reduce the number of linking iterations needed to span the maximal read coverage.
A linked alignment extension has to cover at least 70% of the read, i.e. 70% of the read has to be aligned back the to reference genome, for the read alignment to be considered valid.
At the end of the seed-and-extend process, there will be various possible read alignments scenarios based on the implementation choices above. To better understand some of the output and possibly the code, you likely have to get acquitance with the finer granularity of read classification based on its linked alignment extension(s).
- There is no linked alignment extension reported, and the read is classified as "without candidate".
- Multiple linked alignment extensions are reported, and the read is classified as "multi-candidates" (MC) read.
- Only a single linked alignment extension is reported, and the read is classified as "single-candidate" (SC) read. SC reads can be further sub-classified as follow:
- When the SC read has only a single HSP segment, it is also known as "single-candidate-single-fragment" (SCSF) reads. SCSF reads may record small variants but we do not explore them currently.
- In the scenarios that there are linked segments, it is also know as "single-candidate-multiple-fragments" (SCMF) reads. SCMF read (aka split-read) could record putative SVs.
- When there is only a single member in every segment group of SCMF, we call this read a "single-candidate-multiple-fragments-single-locus" (SCMFSL) read.
- Conversely, if any segment group of the SCMF has multiple members, we call this read a "single-candidate-multiple-fragments-multiple-loci" (SCMFML) read.
Finally, SV calling was performed with Picky’s callSV command. Read with a single putative read alignment with linked segment(s), known as split-read, contains putative SV.
Recall that a read alignment is a linked alignment extension containing (HSP) segment(s) that are colinear on the read coordinate. Let us denote High-scoring Segment Pair (HSP) i between the read segment and the reference segment as Qi and Si respectively. Thus, a linked alignment extensions with 3 segments will have 3 HSPs indicated by Q1:S1, Q2:S2 and Q3:S3.
Each segment span is denoted by (start,end] as per UCSC 0-start, half-open coordinate system and we define the followings:
- sDiffi,i+1 as the distance between reference segment Si and Si+1 = Si+1(start)-Si(end)
- qDiffi,i+1 as the distance between read segment Qi and Qi+1 = Qi+1(start)-Qi(end)
NOTE: sDiffi,i+1 and qDiffi,i+1 can take on negative values besides the obvious zero and positives. Specifically, a positive value indicates a 'gap'/unmapped region (read)/deleted region (reference), a zero implies that the two segments i and i+1 are bookended, and a negative value indicates that the two segments i and i+1 overlap (e.g. tandem duplication).
Picky computed the distances between adjacent pair of segments in a split-read in both the reference genome-coordinate (sDiff) and the read-coordinates (qDiff). The distances sDiff and qDiff along with chromosome and alignment strand was used to detect the SV present as below.