Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite complex SV functional annotation in SVAnnotate #8516

Merged
merged 4 commits into from
Jan 23, 2024

Conversation

epiercehoffman
Copy link
Contributor

Updates SVAnnotate's functional consequence annotation of complex SVs.

  • Introduce PREDICTED_PARTIAL_DISPERSED_DUP annotation to describe dispersed duplications of coding sequence that are not expected to behave the same way as tandem duplications
  • Ignore INV intervals in dDUP events
  • Modify INV intervals in dupINV and similar events to more accurately capture the overall impact of the complex SV
  • Ignore complex DUP segments for promoter, noncoding, and nearest TSS annotations because these DUPs are never in tandem
  • Merge relevant intervals before annotating nearest TSS for complex events containing DELs
  • Update documentation

Testing

  • Add unit test for CPX SV segment determination
  • Add CPX SV unit test cases
  • Update unit/integration test expected outputs
  • All unit & integration tests for SVAnnotate ran successfully

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @epiercehoffman, great work here. As we discussed in person, there's a few implementation details I'd suggest mainly for readability and maintainability.

Comment on lines 224 to 225
}
if (isComplex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
if (isComplex) {
} else if (isComplex) {

Just a little clearer

* @param transcript - protein-coding GTF transcript
* @param variantConsequenceDict - running map of consequence -> feature name for variant to update
*/
@VisibleForTesting
protected void annotateTranscript(final SimpleInterval variantInterval,
final GATKSVVCFConstants.StructuralVariantAnnotationType svType,
final boolean isComplex,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am realizing that there is a slight semantic issue with StructuralVariantAnnotationType since it now contains the CPX type. It is then somewhat confusing to have the isComplex parameter.

I think the issue stems from the fact that this method is intended for use on an SVSegment rather than a full variant. One option is to create, say, a SVSegmentAnnotationType that is the same as StructuralVariantAnnotationType but omits CPX, but that's probably unnecessary.

I think at least some of the variable names should change here, variantInterval -> segmentInterval, svType -> segmentType, variantConsequenceDict -> segmentConsequenceDict, and perhaps even annotateTranscript -> annotateSegmentTranscript. Also updating the documentation to make this clear would help.

It's not hugely a concern since this is intended to be a private method, but this would improve the readability I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this and I'm not sure of the optimal names here.

  • I agree that the StructuralVariantAnnotationType is being overused here, although I agree that it may not be worthwhile to create two different types with and without CPX. Maybe if it was just StructuralVariantType for a more neutral category - especially since it is getting used in other tools now?
  • I think variantInterval is not inaccurate, since it is an interval of an SV / an interval that varies compared to the reference, and it is more descriptive than segmentInterval when distinguishing between a feature interval and a variant interval. But it is true that only one segment is considered at a time - which is clear because the variantInterval is a single interval rather than a list. Maybe best would be variantSegmentInterval? Or some documentation changes?
  • variantConsequenceDict is the correct name because it accumulates consequences across SV segments
  • True, annotateGeneOverlaps and annotateTranscript are only applied to one segment at a time. But the method names focus on the features from the GTF rather than saying they are applied to a variant vs. a segment. Perhaps just a docs change here?

* @return - SimpleInterval representing the portion of primaryInterval not overlapped by secondaryInterval
*/
@VisibleForTesting
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add this to the SimpleInterval class? Make is similar to the spanWith() method. Let's call it extendWithSpan().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started doing this originally but then realized that it's not super generalizable because I made a lot of assumptions about the intervals. I could still add it but would need to enforce the secondary interval overlapping but not containing the primary interval

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also the name subtractInterval (a la bedtools).

@VisibleForTesting
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval,
final SimpleInterval secondaryInterval) {
if (primaryInterval.getStart() < secondaryInterval.getStart()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check that the contigs are the same.

final List<String> cpxIntervalsString = variant.getAttributeAsStringList(GATKSVVCFConstants.CPX_INTERVALS, null);
if (cpxIntervalsString == null) {
final List<String> cpxIntervals = variant.getAttributeAsStringList(GATKSVVCFConstants.CPX_INTERVALS, null);
if (cpxIntervals == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently I learned that the defaultValue parameter in getAttributeAsStringList() is used as the default for list items that are null (i.e. .), but the method always returns a List. I'm not aware of a case where we expect an empty list or contains a null entry, so I think what you want to check is if cpxIntervals.isEmpty() or it contains null.

}
// save INV interval to adjust later for dupINV / INVdup / dupINVdup / dupINVdel / delINVdup
else if (complexType.contains("INV") && complexType.contains("dup")) {
inversionIntervalToAdjust = new SimpleInterval(interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to call new here?

@@ -620,7 +678,46 @@ protected static List<SVSegment> getSVSegments(final VariantContext variant,
return intervals;
}

/**
* Update list of SVSegments to use for promoter & noncoding annotations for complex SVs. Removes DUP segments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Update list of SVSegments to use for promoter & noncoding annotations for complex SVs. Removes DUP segments
* Returns a new subsetted list of SVSegments to use for promoter & noncoding annotations for complex SVs. Removes DUP segments

IMO, "Updates" implies in-place modification of the input list

Comment on lines 689 to 695
final List<SVSegment> updatedSegments = new ArrayList<>(svSegments.size());
for (final SVSegment svSegment : svSegments) {
if (svSegment.getIntervalSVType() != GATKSVVCFConstants.StructuralVariantAnnotationType.DUP) {
updatedSegments.add(svSegment);
}
}
return updatedSegments;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done in one line with a stream:

return svSegments.stream().filter(seg -> seg.getIntervalSVType() != GATKSVVCFConstants.StructuralVariantAnnotationType.DUP).collect(Collectors.toList());


/**
* Update list of SVSegments to use for nearest TSS annotations for complex SVs. DUP segments are already removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Update list of SVSegments to use for nearest TSS annotations for complex SVs. DUP segments are already removed.
* Returns a new subsetted list of SVSegments to use for nearest TSS annotations for complex SVs. DUP segments are already removed.

Comment on lines 532 to 533
protected static List<SVSegment> getComplexAnnotationIntervals(final List<String> cpxIntervals,
final String complexType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more maintainable to have separate parsing and processing methods for this. Define inner classes ComplexInterval and ComplexType that are created in the parsing method(s) and then pass them in here. Ideally you would make use of the GATKSVVCFConstants.ComplexVariantSubtype type for this.

Doing so is a bit more "brittle," but it is safer to be explicitly checking and representing the input this way. Also it avoids the string matching, which can be prone to bugs.

* use constants for CPX types (add INS_iDEL, CTX_PP_QQ, CTX_PQ_QP)
* add and improve comments
* parse and process CPX_INTERVALS in separate functions
* ignore INS interval in CPX_INTERVALS for INS_iDEL
* add unit tests for new cases
* ignore PARTIAL_DISPERSED_DUP consequence when checking if SV is intergenic
* syntax and style fixes
Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. It took me a while to fully understand what was happening with complex events but I think it all makes sense after thinking about it for a bit. The annotateStructuralVariant method seemed to rely a bit on mutable state in a way that's a bit confusing, I think. Apart from that I mostly have style/documentation requests.

@@ -123,6 +123,10 @@
* duplicated. The partial duplication occurs when a duplication has one breakpoint within the transcript and one
* breakpoint after the end of the transcript. When the duplication is in tandem, the result is that there is one
* intact copy of the full endogenous gene.</p></li>
* <li><p><i>PREDICTED_PARTIAL_DISPERSED_DUP</i><br />
* Gene(s) which are partially overlapped by an SV's dispersed duplication. This annotation is applied to a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify this a little more ("an SV's dispersed duplication" could potentially be the insert interval in the mind of some readers I think), I might suggest the wording, "Gene(s) which are partially overlapped by the duplicated segment involved in an SV's dispersed duplication."

@@ -354,6 +358,7 @@ private void addAnnotationInfoKeysToHeader(final VCFHeader header) {
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.NONCODING_SPAN, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Class(es) of noncoding elements spanned by SV."));
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.NONCODING_BREAKPOINT, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Class(es) of noncoding elements disrupted by SV breakpoint."));
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.NEAREST_TSS, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Nearest transcription start site to an intergenic variant."));
header.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.PARTIAL_DISPERSED_DUP, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "Gene(s) overlapped partially by a dispersed duplication in a complex SV."));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I might suggest, "Gene(s) overlapped partially by a the duplicated interval involved in a dispersed duplication event in a complex SV"

final SimpleInterval transcriptInterval = new SimpleInterval(gtfTranscript);
if (variantSpansFeature(variantInterval, transcriptInterval)) {
return GATKSVVCFConstants.COPY_GAIN;
return GATKSVVCFConstants.COPY_GAIN; // return CG immediately because same regardless of isDispersed
} else if (isComplex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no defined complex event types that include a tandem duplication? It might be good to document that assumption somewhere if you haven't already. You could also consider renaming isComplex to isDispersedDuplication also to future proof it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in our current set of CPX events all DUP segments are dispersed. I added a comment to clarify this. I like the idea of renaming the variable, although I currently only check the SVTYPE - would it be better to check the CPX_TYPE against our current list of DUP-containing CPX events to set an isDispersedDuplication variable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth making this assumption explicit in the code. You could make a method boolean includesDispersedDuplication(ComplexVariantSubtype cpxType) that returns this. There might be a couple of other places in the code that could use this method (in this class or in SVCallRecordUtils). If we could start making some methods like that eventually we could have ComplexVariant be a full class that knows what the corresponding segments should be and could validate them and answer questions about them.

* @return - SimpleInterval representing the portion of primaryInterval not overlapped by secondaryInterval
*/
@VisibleForTesting
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also the name subtractInterval (a la bedtools).

return new SVSegment(svTypeForInterval, interval);
protected static SimpleInterval getNonOverlappingInterval(final SimpleInterval primaryInterval,
final SimpleInterval secondaryInterval) {
if (!primaryInterval.getContig().equals(secondaryInterval.getContig())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it would be worth checking the assumption that the intervals do overlap -- otherwise it'll return results that don't make much sense.

@@ -620,7 +717,49 @@ protected static List<SVSegment> getSVSegments(final VariantContext variant,
return intervals;
}

/**
* Returns a new subsetted list of SVSegments to use for promoter & noncoding annotations for complex SVs.
* Removes DUP segments which are never tandem in CPX events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true only for our current set of defined CPX events, right? Might want to modify this comment to reflect that assumption.

* @return - Subsetted list of SVSegments to use for promoter & noncoding annotations for CPX SVs
*/
@VisibleForTesting
protected static List<SVSegment> getSegmentsForNonCodingAnnotations(final List<SVSegment> svSegments) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd modify the name of this to reinforce that it's meant for CPX events only. Or alternatively, you might consider making a little helper class (CPXEventAnnotationSegmenter?) that groups some of these methods that are specific to CPX events: getSegmentsForNonCodingAnnotations, getSegmentForNearestTSS, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these functions to run on all SV types instead of just complex, to go with the other change you suggested below to create new lists of segments for each type of annotation

// annotate nearest TSS for intergenic variants with no promoter overlaps
if (gtfIntervalTrees != null && gtfIntervalTrees.getTranscriptionStartSiteTree() != null &&
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && noCodingAnnotations) {
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && isIntergenic) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less confusing to just call the isIntergenic() method here. It already checks the keys of variantConsquenceDict so you wouldn't need to check for PROMOTER, and we don't refer to the isIntergenic variable prior to this point. You could just get rid of the variable altogether IMO.

Copy link
Contributor Author

@epiercehoffman epiercehoffman Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that I could move TSS annotations above non-coding annotations then call isIntergenic() here for the same result; however, a variant can be intergenic and still have promoter annotations, in which case it would not get a nearest TSS annotation, and in that case I think it's confusing to reuse the isIntergenic() function to identify a different set of criteria. And I think I need the isIntergenic boolean because I need to store the information about whether there are any coding annotations to use later to apply the INTERGENIC flag. Alternatively I could modify the function to ignore the noncoding annotations, and maybe create two different functions - one that ignores all noncoding annotations and one that ignores noncoding annotations except for promoters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you make the isIntergenic method smart enough to ignore promoter consequences? It just feels like you are adding state here to work around the fact that the method is only looking at protein coding consequences. Sorry if I am missing something though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated isIntergenic() to check for an absence of protein-coding consequences and ignore other consequences, so that the result is valid as long as all protein-coding consequences have been annotated, even if non-coding annotations have been added. Let me know if that addresses the issue


// for CPX events, update SV segments to annotate promoter & noncoding consequences
if (overallSVType == GATKSVVCFConstants.StructuralVariantAnnotationType.CPX) {
svSegments = getSegmentsForNonCodingAnnotations(svSegments);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing the value of svSegments I think it would be better to be explicit and make (final) svSegmentsForGeneOverlaps, svSegmentsForNonCodingAnnotations, and svSegmentsForNearestTSS variables that you set appropriately before each call. Overriding svSegments before each call seems like too much dependence on mutable state to me.

@@ -581,6 +744,54 @@ public Object[][] getAnnotateStructuralVariantTestData() {
createAttributesMap(
Arrays.asList(GATKSVVCFConstants.PROMOTER, GATKSVVCFConstants.INTERGENIC),
Arrays.asList("EMMA1", true)) },
// CPX with dDUP CG and promoter annotations
{ createVariantContext("chr1", 10, 11, null, null, null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have all the CPX types tested here: dupINVdup, delINVdup. Also I was wondering about a gene that fully lies within the DUP interval of a CPX event like dupINVdup -- I think it will work but it'd be nice to see an example tested here where that gets a copy gain annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dupINVdup with copy gain. There are still 5 complex types not represented in this particular unit test (although all types are represented in testGetSegmentsFromComplexIntervals, a few more are in testAnnotateComplexEvents, and I tried to cover all the interval modifications that are done for noncoding and TSS annotations)

final SimpleInterval transcriptInterval = new SimpleInterval(gtfTranscript);
if (variantSpansFeature(variantInterval, transcriptInterval)) {
return GATKSVVCFConstants.COPY_GAIN;
return GATKSVVCFConstants.COPY_GAIN; // return CG immediately because same regardless of isDispersed
} else if (isComplex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth making this assumption explicit in the code. You could make a method boolean includesDispersedDuplication(ComplexVariantSubtype cpxType) that returns this. There might be a couple of other places in the code that could use this method (in this class or in SVCallRecordUtils). If we could start making some methods like that eventually we could have ComplexVariant be a full class that knows what the corresponding segments should be and could validate them and answer questions about them.

@VisibleForTesting
protected static boolean isIntergenic(final Map<String,Set<String>> variantConsequenceDict) {
for (final String consequence : variantConsequenceDict.keySet()) {
if (consequence != GATKSVVCFConstants.PARTIAL_DISPERSED_DUP) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be compared with !GATKSVVCFConstants.PARTIAL_DISPERSED_DUP.equals(consequence)

// annotate nearest TSS for intergenic variants with no promoter overlaps
if (gtfIntervalTrees != null && gtfIntervalTrees.getTranscriptionStartSiteTree() != null &&
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && noCodingAnnotations) {
!variantConsequenceDict.containsKey(GATKSVVCFConstants.PROMOTER) && isIntergenic) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you make the isIntergenic method smart enough to ignore promoter consequences? It just feels like you are adding state here to work around the fact that the method is only looking at protein coding consequences. Sorry if I am missing something though.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good now!

@epiercehoffman epiercehoffman merged commit e796d20 into master Jan 23, 2024
20 checks passed
@epiercehoffman epiercehoffman deleted the eph_cpx_svannotate branch January 23, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants