-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vs 1063 bit pack ref ranges data into a more efficient representation #8543
Vs 1063 bit pack ref ranges data into a more efficient representation #8543
Conversation
…o expand out the packed ref field so it looks like what the prepare tables are expecting
…hat I copied here
…sed ref extract. But, until I sort out a better way to load them in advance, this should work.
… integration used compressed refs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation, unit tests over core logic, WDL fix.
EDIT: Also as discussed in standup 2023-10-10, the Hail integration tests which effectively represent the AoU path should switch this on by default as we're currently planning to use this feature from Echo onward.
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/CreateVariantIngestFiles.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/ingest/RefRangesAvroWriter.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/gvs/common/SchemaUtils.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A couple of comments.
Also, please put in the code / documentation somewhere the representation of the packed ref_ranges table (the one in the spike: https://broadworkbench.atlassian.net/browse/VS-973
…to lock the bit packing behavior in place
…ate the gatk docker
…plain-er English for easier comprehension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if we're very sure there won't be Docker collisions
flipping defaults back
Adding in the option for using a compressed representation of our reference data. The flag use_compressed_references needs to be passed into ingest and also export, and doing so causes us to use a different internal schema for the ref_ranges table that can save ~40% on space. On the VCF path, the packed data is expanded while we populate the prepare tables before extract. On the VDS path, the packed data is expanded while we extract from the tables to create the Avro files.
I updated integration tests to take use_compressed_refs as an option, and saw what I needed. Integrations tests for the VCF and VDS paths go to completion and fail, but in the expected way. AssertIdenticalOutputs, the important part, passes. AssertTableSizesAreExpected fails because the ref_ranges table is so much smaller (255884464 vs 431805033 expected) and AssertCostIsTrackedAndExpected fails because the PrepareRangesCallsetTask.GvsPrepareRanges.BigQuery Query Scanned is similarly smaller (340787200 vs 515899392). Given the nature of the change and the fact that it is an optional flag that defaults to zero, updating the integration tests to cover this path did not seem sound at this time. But using them to verify that the content of the final vcfs wasn't affected by the change makes more sense.
VCF integration run: https://job-manager.dsde-prod.broadinstitute.org/jobs/e9896786-9d3f-48b9-ab6f-9a76d9fafafe
Hail integration run: https://job-manager.dsde-prod.broadinstitute.org/jobs/dbdd0934-50c3-4477-a191-3282341eacd3
another post-merge integration run with same results: identical outputs, failure due to lower table sizes and query scan costs
https://job-manager.dsde-prod.broadinstitute.org/jobs/edabdac3-1856-4e2a-aadc-3bcd4c353956