-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite complex SV functional annotation in SVAnnotate #8516
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @epiercehoffman, great work here. As we discussed in person, there's a few implementation details I'd suggest mainly for readability and maintainability.
} | ||
if (isComplex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
if (isComplex) { | |
} else if (isComplex) { |
Just a little clearer
* @param transcript - protein-coding GTF transcript | ||
* @param variantConsequenceDict - running map of consequence -> feature name for variant to update | ||
*/ | ||
@VisibleForTesting | ||
protected void annotateTranscript(final SimpleInterval variantInterval, | ||
final GATKSVVCFConstants.StructuralVariantAnnotationType svType, | ||
final boolean isComplex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am realizing that there is a slight semantic issue with StructuralVariantAnnotationType
since it now contains the CPX
type. It is then somewhat confusing to have the isComplex
parameter.
I think the issue stems from the fact that this method is intended for use on an SVSegment
rather than a full variant. One option is to create, say, a SVSegmentAnnotationType
that is the same as StructuralVariantAnnotationType
but omits CPX
, but that's probably unnecessary.
I think at least some of the variable names should change here, variantInterval
-> segmentInterval
, svType
-> segmentType
, variantConsequenceDict
-> segmentConsequenceDict
, and perhaps even annotateTranscript
-> annotateSegmentTranscript
. Also updating the documentation to make this clear would help.
It's not hugely a concern since this is intended to be a private method, but this would improve the readability I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this and I'm not sure of the optimal names here.
- I agree that the
StructuralVariantAnnotationType
is being overused here, although I agree that it may not be worthwhile to create two different types with and without CPX. Maybe if it was justStructuralVariantType
for a more neutral category - especially since it is getting used in other tools now? - I think
variantInterval
is not inaccurate, since it is an interval of an SV / an interval that varies compared to the reference, and it is more descriptive thansegmentInterval
when distinguishing between a feature interval and a variant interval. But it is true that only one segment is considered at a time - which is clear because thevariantInterval
is a single interval rather than a list. Maybe best would bevariantSegmentInterval
? Or some documentation changes? variantConsequenceDict
is the correct name because it accumulates consequences across SV segments- True,
annotateGeneOverlaps
andannotateTranscript
are only applied to one segment at a time. But the method names focus on the features from the GTF rather than saying they are applied to a variant vs. a segment. Perhaps just a docs change here?
* @return - SimpleInterval representing the portion of primaryInterval not overlapped by secondaryInterval | ||
*/ | ||
@VisibleForTesting | ||
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add this to the SimpleInterval
class? Make is similar to the spanWith()
method. Let's call it extendWithSpan()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started doing this originally but then realized that it's not super generalizable because I made a lot of assumptions about the intervals. I could still add it but would need to enforce the secondary interval overlapping but not containing the primary interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider also the name subtractInterval
(a la bedtools).
@VisibleForTesting | ||
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval, | ||
final SimpleInterval secondaryInterval) { | ||
if (primaryInterval.getStart() < secondaryInterval.getStart()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check that the contigs are the same.
final List<String> cpxIntervalsString = variant.getAttributeAsStringList(GATKSVVCFConstants.CPX_INTERVALS, null); | ||
if (cpxIntervalsString == null) { | ||
final List<String> cpxIntervals = variant.getAttributeAsStringList(GATKSVVCFConstants.CPX_INTERVALS, null); | ||
if (cpxIntervals == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently I learned that the defaultValue
parameter in getAttributeAsStringList()
is used as the default for list items that are null (i.e. .
), but the method always returns a List
. I'm not aware of a case where we expect an empty list or contains a null entry, so I think what you want to check is if cpxIntervals.isEmpty()
or it contains null
.
} | ||
// save INV interval to adjust later for dupINV / INVdup / dupINVdup / dupINVdel / delINVdup | ||
else if (complexType.contains("INV") && complexType.contains("dup")) { | ||
inversionIntervalToAdjust = new SimpleInterval(interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to call new
here?
@@ -620,7 +678,46 @@ protected static List<SVSegment> getSVSegments(final VariantContext variant, | |||
return intervals; | |||
} | |||
|
|||
/** | |||
* Update list of SVSegments to use for promoter & noncoding annotations for complex SVs. Removes DUP segments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Update list of SVSegments to use for promoter & noncoding annotations for complex SVs. Removes DUP segments | |
* Returns a new subsetted list of SVSegments to use for promoter & noncoding annotations for complex SVs. Removes DUP segments |
IMO, "Updates" implies in-place modification of the input list
final List<SVSegment> updatedSegments = new ArrayList<>(svSegments.size()); | ||
for (final SVSegment svSegment : svSegments) { | ||
if (svSegment.getIntervalSVType() != GATKSVVCFConstants.StructuralVariantAnnotationType.DUP) { | ||
updatedSegments.add(svSegment); | ||
} | ||
} | ||
return updatedSegments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done in one line with a stream:
return svSegments.stream().filter(seg -> seg.getIntervalSVType() != GATKSVVCFConstants.StructuralVariantAnnotationType.DUP).collect(Collectors.toList());
|
||
/** | ||
* Update list of SVSegments to use for nearest TSS annotations for complex SVs. DUP segments are already removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Update list of SVSegments to use for nearest TSS annotations for complex SVs. DUP segments are already removed. | |
* Returns a new subsetted list of SVSegments to use for nearest TSS annotations for complex SVs. DUP segments are already removed. |
protected static List<SVSegment> getComplexAnnotationIntervals(final List<String> cpxIntervals, | ||
final String complexType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more maintainable to have separate parsing and processing methods for this. Define inner classes ComplexInterval
and ComplexType
that are created in the parsing method(s) and then pass them in here. Ideally you would make use of the GATKSVVCFConstants.ComplexVariantSubtype
type for this.
Doing so is a bit more "brittle," but it is safer to be explicitly checking and representing the input this way. Also it avoids the string matching, which can be prone to bugs.
* use constants for CPX types (add INS_iDEL, CTX_PP_QQ, CTX_PQ_QP) * add and improve comments * parse and process CPX_INTERVALS in separate functions * ignore INS interval in CPX_INTERVALS for INS_iDEL * add unit tests for new cases * ignore PARTIAL_DISPERSED_DUP consequence when checking if SV is intergenic * syntax and style fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. It took me a while to fully understand what was happening with complex events but I think it all makes sense after thinking about it for a bit. The annotateStructuralVariant
method seemed to rely a bit on mutable state in a way that's a bit confusing, I think. Apart from that I mostly have style/documentation requests.
@@ -123,6 +123,10 @@ | |||
* duplicated. The partial duplication occurs when a duplication has one breakpoint within the transcript and one | |||
* breakpoint after the end of the transcript. When the duplication is in tandem, the result is that there is one | |||
* intact copy of the full endogenous gene.</p></li> | |||
* <li><p><i>PREDICTED_PARTIAL_DISPERSED_DUP</i><br /> | |||
* Gene(s) which are partially overlapped by an SV's dispersed duplication. This annotation is applied to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify this a little more ("an SV's dispersed duplication" could potentially be the insert interval in the mind of some readers I think), I might suggest the wording, "Gene(s) which are partially overlapped by the duplicated segment involved in an SV's dispersed duplication."
@@ -354,6 +358,7 @@ private void addAnnotationInfoKeysToHeader(final VCFHeader header) { | |||
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.NONCODING_SPAN, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Class(es) of noncoding elements spanned by SV.")); | |||
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.NONCODING_BREAKPOINT, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Class(es) of noncoding elements disrupted by SV breakpoint.")); | |||
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.NEAREST_TSS, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Nearest transcription start site to an intergenic variant.")); | |||
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.PARTIAL_DISPERSED_DUP, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Gene(s) overlapped partially by a dispersed duplication in a complex SV.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I might suggest, "Gene(s) overlapped partially by a the duplicated interval involved in a dispersed duplication event in a complex SV"
final SimpleInterval transcriptInterval = new SimpleInterval(gtfTranscript); | ||
if (variantSpansFeature(variantInterval, transcriptInterval)) { | ||
return GATKSVVCFConstants.COPY_GAIN; | ||
return GATKSVVCFConstants.COPY_GAIN; // return CG immediately because same regardless of isDispersed | ||
} else if (isComplex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there no defined complex event types that include a tandem duplication? It might be good to document that assumption somewhere if you haven't already. You could also consider renaming isComplex
to isDispersedDuplication
also to future proof it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, in our current set of CPX events all DUP segments are dispersed. I added a comment to clarify this. I like the idea of renaming the variable, although I currently only check the SVTYPE - would it be better to check the CPX_TYPE against our current list of DUP-containing CPX events to set an isDispersedDuplication
variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth making this assumption explicit in the code. You could make a method boolean includesDispersedDuplication(ComplexVariantSubtype cpxType)
that returns this. There might be a couple of other places in the code that could use this method (in this class or in SVCallRecordUtils). If we could start making some methods like that eventually we could have ComplexVariant be a full class that knows what the corresponding segments should be and could validate them and answer questions about them.
* @return - SimpleInterval representing the portion of primaryInterval not overlapped by secondaryInterval | ||
*/ | ||
@VisibleForTesting | ||
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider also the name subtractInterval
(a la bedtools).
return new SVSegment(svTypeForInterval, interval); | ||
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval, | ||
final SimpleInterval secondaryInterval) { | ||
if (!primaryInterval.getContig().equals(secondaryInterval.getContig())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it would be worth checking the assumption that the intervals do overlap -- otherwise it'll return results that don't make much sense.
@@ -620,7 +717,49 @@ protected static List<SVSegment> getSVSegments(final VariantContext variant, | |||
return intervals; | |||
} | |||
|
|||
/** | |||
* Returns a new subsetted list of SVSegments to use for promoter & noncoding annotations for complex SVs. | |||
* Removes DUP segments which are never tandem in CPX events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true only for our current set of defined CPX events, right? Might want to modify this comment to reflect that assumption.
* @return - Subsetted list of SVSegments to use for promoter & noncoding annotations for CPX SVs | ||
*/ | ||
@VisibleForTesting | ||
protected static List<SVSegment> getSegmentsForNonCodingAnnotations(final List<SVSegment> svSegments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd modify the name of this to reinforce that it's meant for CPX events only. Or alternatively, you might consider making a little helper class (CPXEventAnnotationSegmenter
?) that groups some of these methods that are specific to CPX events: getSegmentsForNonCodingAnnotations
, getSegmentForNearestTSS
, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated these functions to run on all SV types instead of just complex, to go with the other change you suggested below to create new lists of segments for each type of annotation
// annotate nearest TSS for intergenic variants with no promoter overlaps | ||
if (gtfIntervalTrees != null && gtfIntervalTrees.getTranscriptionStartSiteTree() != null && | ||
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && noCodingAnnotations) { | ||
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && isIntergenic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be less confusing to just call the isIntergenic()
method here. It already checks the keys of variantConsquenceDict
so you wouldn't need to check for PROMOTER
, and we don't refer to the isIntergenic
variable prior to this point. You could just get rid of the variable altogether IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that I could move TSS annotations above non-coding annotations then call isIntergenic() here for the same result; however, a variant can be intergenic and still have promoter annotations, in which case it would not get a nearest TSS annotation, and in that case I think it's confusing to reuse the isIntergenic() function to identify a different set of criteria. And I think I need the isIntergenic boolean because I need to store the information about whether there are any coding annotations to use later to apply the INTERGENIC flag. Alternatively I could modify the function to ignore the noncoding annotations, and maybe create two different functions - one that ignores all noncoding annotations and one that ignores noncoding annotations except for promoters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you make the isIntergenic
method smart enough to ignore promoter consequences? It just feels like you are adding state here to work around the fact that the method is only looking at protein coding consequences. Sorry if I am missing something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated isIntergenic() to check for an absence of protein-coding consequences and ignore other consequences, so that the result is valid as long as all protein-coding consequences have been annotated, even if non-coding annotations have been added. Let me know if that addresses the issue
|
||
// for CPX events, update SV segments to annotate promoter & noncoding consequences | ||
if (overallSVType == GATKSVVCFConstants.StructuralVariantAnnotationType.CPX) { | ||
svSegments = getSegmentsForNonCodingAnnotations(svSegments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than changing the value of svSegments
I think it would be better to be explicit and make (final
) svSegmentsForGeneOverlaps
, svSegmentsForNonCodingAnnotations
, and svSegmentsForNearestTSS
variables that you set appropriately before each call. Overriding svSegments
before each call seems like too much dependence on mutable state to me.
@@ -581,6 +744,54 @@ public Object[][] getAnnotateStructuralVariantTestData() { | |||
createAttributesMap( | |||
Arrays.asList(GATKSVVCFConstants.PROMOTER, GATKSVVCFConstants.INTERGENIC), | |||
Arrays.asList("EMMA1", true)) }, | |||
// CPX with dDUP CG and promoter annotations | |||
{ createVariantContext("chr1", 10, 11, null, null, null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have all the CPX types tested here: dupINVdup
, delINVdup
. Also I was wondering about a gene that fully lies within the DUP interval of a CPX event like dupINVdup
-- I think it will work but it'd be nice to see an example tested here where that gets a copy gain annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dupINVdup with copy gain. There are still 5 complex types not represented in this particular unit test (although all types are represented in testGetSegmentsFromComplexIntervals, a few more are in testAnnotateComplexEvents, and I tried to cover all the interval modifications that are done for noncoding and TSS annotations)
final SimpleInterval transcriptInterval = new SimpleInterval(gtfTranscript); | ||
if (variantSpansFeature(variantInterval, transcriptInterval)) { | ||
return GATKSVVCFConstants.COPY_GAIN; | ||
return GATKSVVCFConstants.COPY_GAIN; // return CG immediately because same regardless of isDispersed | ||
} else if (isComplex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth making this assumption explicit in the code. You could make a method boolean includesDispersedDuplication(ComplexVariantSubtype cpxType)
that returns this. There might be a couple of other places in the code that could use this method (in this class or in SVCallRecordUtils). If we could start making some methods like that eventually we could have ComplexVariant be a full class that knows what the corresponding segments should be and could validate them and answer questions about them.
@VisibleForTesting | ||
protected static boolean isIntergenic(final Map<String,Set<String>> variantConsequenceDict) { | ||
for (final String consequence : variantConsequenceDict.keySet()) { | ||
if (consequence != GATKSVVCFConstants.PARTIAL_DISPERSED_DUP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be compared with !GATKSVVCFConstants.PARTIAL_DISPERSED_DUP.equals(consequence)
// annotate nearest TSS for intergenic variants with no promoter overlaps | ||
if (gtfIntervalTrees != null && gtfIntervalTrees.getTranscriptionStartSiteTree() != null && | ||
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && noCodingAnnotations) { | ||
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && isIntergenic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you make the isIntergenic
method smart enough to ignore promoter consequences? It just feels like you are adding state here to work around the fact that the method is only looking at protein coding consequences. Sorry if I am missing something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, looks good now!
Updates SVAnnotate's functional consequence annotation of complex SVs.
Testing