NeuroGAP downsampling project

This repository contains the code generated and run to systematically compare low-coverage versus GWAS accuracy and sensitivity. Starting data consisted of high coverage whole genomes (target: 30X) from the 5 study sites included in this project. Reads or variants were then downsampled to correspond with a depth of coverage or sites from a GWAS array. Variant call set quality was assessed and performance was compared with respect to the high coverage genomes as a "truth" set.

Processing steps

For sequencing data:

Downsample cram files to specific depths (see workflow downsample-bam.wdl with corresponding example inputs downsample-bam.json)
Run HaplotypeCaller per-sample with appropriate arguments (see example inputs)
Create a joint call set using best practices with appropriate arguments (see example inputs)

For array data:

Subset full call set to sites on specific arrays using Hail, as described in extract_array_sites.py

For both:

Run BEAGLE using this workflow (beagle-refine-impute.wdl) and with corresponding example inputs (beagle-refine-impute.json). Differences between sequencing vs array analyses:
Use a 2-step approach for the low-coverage data consisting of genotyping refinement using genotype likelihoods rather than hard genotype calls (fields called GL rather than GT in VCF parlance). After genotypes are refined, use the hard calls to run imputation. BEAGLE v4.1 was used for refinement, and BEAGLE v5.1 was used for imputation. Different reference panels were used for these steps, as refinement is much more computationally intensive (refinement reference panel subsetting script here: refine_reference.py).
No point to refinement for array analyses, just run imputation using BEAGLE v5.1

Analytical steps

In addition to standard variant call metrics from GATK (gathered routinely as part of the beagle-refine-impute.wdl and standard GATK workflows), gather concordance and sensitivity metrics (relative to full dataset) as in variant_info.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
bam-fastq.json		bam-fastq.json
bam-fastq.wdl		bam-fastq.wdl
beagle-refine-impute.json		beagle-refine-impute.json
beagle-refine-impute.wdl		beagle-refine-impute.wdl
downsample-bam.json		downsample-bam.json
downsample-bam.wdl		downsample-bam.wdl
downsampling_concordance.R		downsampling_concordance.R
extract_array_sites.py		extract_array_sites.py
file_paths.py		file_paths.py
intersect_gencove_beagle_imputed.py		intersect_gencove_beagle_imputed.py
merge_gencove_vcfs.py		merge_gencove_vcfs.py
plot_pca.R		plot_pca.R
qc_alignment.R		qc_alignment.R
ref_pca.py		ref_pca.py
refine_reference.py		refine_reference.py
split_vcf_chr.py		split_vcf_chr.py
variant_info.py		variant_info.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuroGAP downsampling project

Processing steps

Analytical steps

About

Releases

Packages

Languages

armartin/neurogap_downsampling

Folders and files

Latest commit

History

Repository files navigation

NeuroGAP downsampling project

Processing steps

Analytical steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages