Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a few additional failures to our notes doc #8980

Merged
merged 4 commits into from
Jan 2, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 30 additions & 13 deletions scripts/variantstore/beta_docs/gvs-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,36 @@ Generally, if you have started the GVS workflow and it failed after ingestion, o
2. `BadRequestException: 400 Bucket is a requester pays bucket but no user project provided.`
1. GVS can ingest data from a requester pays bucket by setting the optional `billing_project_id` input variable. This variable takes a string of a Google project ID to charge for the egress of the GVCFs and index files.

## Runtime errors
## Ingestion-Specific Issues"
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. GVS is running very slowly!
1. If your GVS workflow is running very slowly compared to the example runtimes in the workspace, you may have run GVS on GVCFs that have not been reblocked. Confirm your GVCFs are reblocked.
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. My workflow failed during ingestion, can I restart it?
1. If it fails during ingestion, yes, the GvsBeta workflow is restartable and will pick up where it left off.
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
2. Duplicate sample names error: `ERROR: The input file ~{sample_names_file} contains the following duplicate entries:`
1. The GVS requires that sample names are unique because the sample names are used to name the samples in the VCF, and VCF format requires unique sample names.
2. After deleting or renaming the duplicate sample, you can restart the workflow without any clean up.
3. `BulkIngestGenomes/GvsBulkIngestGenomes/hash/call-ImportGenomes/GvsImportGenomes/hash/call-GetUningestedSampleIds/gvs_ids.csv Required file output '/cromwell_root/gvs_ids.csv' does not exist.`
1. If you've attempted to run GVS more than once in the same BigQuery dataset, you may see this error. Please delete the dataset and create a new one. We recommend naming the new dataset something different than the one you deleted.
4. AssignIds failure with error message: `BigQuery error in mk operation: Not found: Dataset`
1. This is saying that GVS was unable to find the BigQuery dataset specified in the inputs. If you haven't created a BigQuery dataset prior to running the workflow, you can follow the steps in [the quickstart](./gvs-quickstart.md). If you created it and still see this error, check the naming of the dataset matches your input specified and that the google project in the inputs is correct. Lastly, confirm you have set up the correct permissions for your Terra proxy account following the instructions in the quickstart.

## Reblocking
1. `htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: file:///cromwell_root/v1_[uuid]`
1. If you are running ReblockGVCF from a TDR snapshot, you will see this error if you did not check the “Convert DRS URLs to Google Cloud Storage Paths (gs://)" box before exporting the snapshot.
2. GVS is running very slowly!
1. If your GVS workflow is running very slowly compared to the example runtimes in the workspace, you may have run GVS on GVCFs that have not been reblocked. Confirm your GVCFs are reblocked.

## Runtime errors
1. `Duplicate sample names error: ERROR: The input file ~{sample_names_file} contains the following duplicate entries:`
1. The GVS requires that sample names are unique because the sample names are used to name the samples in the VCF, and VCF format requires unique sample names.
1. After deleting or renaming the duplicate sample, you can restart the workflow without any clean up.
1. During Ingest: `Required file output '/cromwell_root/gvs_ids.csv' does not exist.`
1. If you've attempted to run GVS more than once in the same BigQuery dataset, you may see this error. Please delete the dataset and create a new one. We recommend naming the new dataset something different than the one you deleted.
1. AssignIds failure with error message: `BigQuery error in mk operation: Not found: Dataset`
1. This is saying that GVS was unable to find the BigQuery dataset specified in the inputs. If you haven't created a BigQuery dataset prior to running the workflow, you can follow the steps in the quickstart. If you created it and still see this error, check the naming of the dataset matches your input specified and that the google project in the inputs is correct. Lastly, confirm you have set up the correct permissions for your Terra proxy account following the instructions in the quickstart.
1. Ingest failure with error message: `raise ValueError("vcf column not in table")`
1. if you have given an incorrect name for the vcf column or the vcf index column
1. You can simply restart the workflow with the correct names
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. Ingest failure with error message: `Invalid resource name projects/gvs_internal; Project id: gvs_internal.`
1. This occurs if you have given the incorrect name of the project.
1. Restart the workflow with the correct name
1. Ingest failure with `Max id is 0. Exiting.`
1. You will want to completely start over and delete your BQ dataset--and then re-create it. It can have the exact same name.
1. Ingest failure: There is already a list of sample names. This may need manual cleanup. Exiting.
1. Clean up the BQ dataset manually by deleting it and recreating it fresh
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. Make sure to keep the call caching on and run it again
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. Ingest failure with error message: `A USER ERROR has occurred: Cannot be missing required value for `___
1. (e.g. alternate_bases.AS_RAW_MQ, RAW_MQandDP or RAW_MQ)
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. This means that there is at least one incorrectly formatted sample in your data model. Confirm your GVCFs are reblocked. If the incorrectly formatted samples are a small portion of your callset and you wish to just ignore them, simply delete the from the data model and restart the workflow without them. There should be no issue with starting from here as none of these samples were loaded.
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. Extract failure with OSError: Is a directory. If you point your extract to a directory that doesn’t already exist, it will not be happy about this. Simply make the directory and run the workflow again.
1. Ingest failure with: `Lock table error`
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. This means that the lock table has been created, but that the ingest has failed soon after or that perhaps during manual cleanup from another failure, some underlying data was deleted
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved
1. The lock table can simply be deleted -- `sample_id_assignment_lock` -- and the ingest can be kicked off again
Loading