Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vs 1063 bit pack ref ranges data into a more efficient representation #8543

Conversation

koncheto-broad
Copy link

@koncheto-broad koncheto-broad commented Oct 6, 2023

Adding in the option for using a compressed representation of our reference data. The flag use_compressed_references needs to be passed into ingest and also export, and doing so causes us to use a different internal schema for the ref_ranges table that can save ~40% on space. On the VCF path, the packed data is expanded while we populate the prepare tables before extract. On the VDS path, the packed data is expanded while we extract from the tables to create the Avro files.

I updated integration tests to take use_compressed_refs as an option, and saw what I needed. Integrations tests for the VCF and VDS paths go to completion and fail, but in the expected way. AssertIdenticalOutputs, the important part, passes. AssertTableSizesAreExpected fails because the ref_ranges table is so much smaller (255884464 vs 431805033 expected) and AssertCostIsTrackedAndExpected fails because the PrepareRangesCallsetTask.GvsPrepareRanges.BigQuery Query Scanned is similarly smaller (340787200 vs 515899392). Given the nature of the change and the fact that it is an optional flag that defaults to zero, updating the integration tests to cover this path did not seem sound at this time. But using them to verify that the content of the final vcfs wasn't affected by the change makes more sense.

VCF integration run: https://job-manager.dsde-prod.broadinstitute.org/jobs/e9896786-9d3f-48b9-ab6f-9a76d9fafafe
Hail integration run: https://job-manager.dsde-prod.broadinstitute.org/jobs/dbdd0934-50c3-4477-a191-3282341eacd3

another post-merge integration run with same results: identical outputs, failure due to lower table sizes and query scan costs
https://job-manager.dsde-prod.broadinstitute.org/jobs/edabdac3-1856-4e2a-aadc-3bcd4c353956

…o expand out the packed ref field so it looks like what the prepare tables are expecting
…sed ref extract. But, until I sort out a better way to load them in advance, this should work.
Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs documentation, unit tests over core logic, WDL fix.

EDIT: Also as discussed in standup 2023-10-10, the Hail integration tests which effectively represent the AoU path should switch this on by default as we're currently planning to use this feature from Echo onward.

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A couple of comments.
Also, please put in the code / documentation somewhere the representation of the packed ref_ranges table (the one in the spike: https://broadworkbench.atlassian.net/browse/VS-973

Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if we're very sure there won't be Docker collisions

scripts/variantstore/wdl/GvsUtils.wdl Show resolved Hide resolved
scripts/variantstore/wdl/GvsImportGenomes.wdl Outdated Show resolved Hide resolved
flipping defaults back
@koncheto-broad koncheto-broad merged commit 0750dc5 into ah_var_store Oct 12, 2023
13 checks passed
@koncheto-broad koncheto-broad deleted the VS-1063-bit-pack-ref-ranges-data-into-a-more-efficient-representation branch October 12, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants