GitHub - b-kolar/Sample-Code---JRC_Seeker

JRC_seeker is a tool for identifying jointly regulated CpGs (JRCs) in the human methylome. These regions can be classified into a number of genomic phenomenon, such as imprinting regions or methylation quantitative trait loci (mQTLs). Developed by the Genetic Identification Lab at Erasmus Medical Center, a BAM file of WGBS reads can be inputted and the result of this tool is a list of JRC locations and their associated p-values. JRC_seeker is built using a Snakemake pipeline that combines Python scripts with Linux shell commands.

Requirements

Operating system: tested on Ubuntu 18.04.6 LTS (Bionic Beaver)
R: tested on R version 3.6.1 (2019-07-05) -- "Action of the Toes"
Python: Python 3.9.12
RAM requirements: Do not attempt to run without at least 50 GB of RAM.
Runtime: Approximately 1 day for 98 GB BAM file and 3.1 GB reference genome, using 20 cores.

Dependencies

conda

Download and install conda if you do not have it already on your machine.

wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh

mamba

Install Mamba into your Conda-based python distribution

conda install -n base -c conda-forge mamba

Activate the Conda base environment (which now includes Mamba).

conda activate base

snakemake (at least v4.3.1)

Create a new conda environment called jrc_seeker with snakemake and python 3.9 in it.

mamba create -c conda-forge -c bioconda -n jrc_seeker snakemake python=3.9

Option: If conda-forge did not work for you, simply create a conda environment like this and use the pip command options below:

conda create -n jrc_seeker python=3.9 snakemake

Activate the jrc_seeker conda environment.

conda activate jrc_seeker

Check whether Snakemake is succesfully installed by running the following command:

snakemake --help

biopython-1.79

Install the following packages: biopython=1.79 .

conda install -c conda-forge biopython=1.79

or

pip install "biopython==1.79"

tabix

conda install -c bioconda tabix

or

pip install tabix

pandas

conda install -c anaconda pandas

or

pip install pandas

bgzip

conda install -c bioconda bgzip

or

pip install bgzip

ChromHMM

Quick instructions on running ChromHMM:

Install Java 1.5 or later if not already installed.
Download and unzip the ChromHMM.zip file using the following code snippit:

wget http://compbio.mit.edu/ChromHMM/ChromHMM.zip
unzip ChromHMM.zip

bedtools

conda install -c bioconda bedtools

or

pip install bedtools

samtools

Recommended version: 1.14

conda install -c bioconda samtools

or

pip install "samtools==1.14"

samblaster

conda install -c bioconda samblaster

or

pip install samblaster

BISCUIT

conda install -c bioconda biscuit

R

If not installed already, be sure you have R version 4.1.2 or greater.

R dependencies

To install R dependencies, open R by running:

And run the following R commands:

# To install
install.packages('data.table')
install.packages('parallel')
install.packages('MASS')

# To verify that installation was succesful
library(data.table)
library(parallel)
library(MASS)

Downloading JRC_Seeker

Clone the repository with the following command:

mkdir jrc_seeker
cd jrc_seeker
git clone https://github.com/b-kolar/jrc_seeker.git

Process Overview

Generate methylation data - pileup, meth_info, format_meth_data
Binarize methylation data - binarize
ChromHMM genome segmentation - learnmodel
BinPolish - binpolish_assets, label_states, binpolish

Generate methylation data

BISCUIT is used to create a pileup VCF of DNA methylation and genetic information. After generating the VCF file, with both genetic and methylation information, beta values and coverage are extracted using the vcf2bed command to study the methylation levels at sequenced CpGs. The output of this step is a BED file of the format below, where the columns represent chromosome, CpG start base, CpG end base, beta methylation value (proportion of reads methylated), and total read coverage.

chr1    10469   10470   0.625   8
chr1    10471   10472   0.444   9
chr1    10484   10485   0.889   9
chr1    10489   10490   1.000   10
chr1    10493   10494   0.875   8

Binarize methylation data

From the above BED file, the counts of methylated and unmethylated reads at every CpG location are calculated using the format_methylation.py script. Using the format_methylation.py script CpGs counts are combined using 200-bp bins and a binarized track is formed per chromosome for all chromosomes that contain methylation data:

methylated  unmethylated
0           1
0           0
1           1
1           0
1           1

The first column represents if the bin contains methylated reads (0 = False, 1 = True) and the second column represents if the bin contains unmethylated reads (0 = False, 1 = True). Thus: 0 1 is an unmethylated bin, 1 0 is a methylated bin, and 1 1 is an intermediately methylated bin. This bins are binarized from count data using their methylation value. A methylation value less than 0.2 is unmethylated and above 0.8 is methylated. Between these values is intermediately methylated. These binarization thresholds can be changed in the config.json file using the lower_im_methylation_bound and the upper_im_methylation_bound variables. The bin size can also be adjusted in the config.json file using the bin_size variable. ChromHMM, the genome segmentation software used in the following step recommends a bin size of 200. Changing this bin size here will also change it for the ChromHMM step.

A threshold is also set to set bins with low coverage to a no-data state, aka 0 0. If there are less that 3 methylated/unmethylated counts (not CpGs), then this bin is set to the low-coverage state. This threshold can be set by changing the k_threshold variable in the config.json file.

ChromHMM genome segmentation

ChromHMM, a genome segmentation and annotation software tool, is leveraged to identify intermediately methylated regions within the provided dataset. Ultimately, this step summarizes the binarized bins into larger regions throughout the genome.

Finding intermediately methylated regions is necessary for finding JRCs. ChromHMM uses a multivariate hidden Markov Model to infer states from binarized data, allowing these regions to be identified and for the rest of the dataset/genome to be removed (as this is not of interest for the purpose of finding CpGs). ChromHMM segments all chromosomes provided into state regions and annotates them with a number. Each of these numbers corresponds to one of the four states of interest to us (no-data, unmethylated, methylated, intermediately methylated).

The output of ChromHMM is a list of segments that correspond to a given state. Using the emission matrix outputted by ChromHMM, the states can be annotated and the regions corresponding to intermediately methylated regions can be identified.

BinPolish

After the ChromHMM segmentation, there are often a large number of intermediately methylated regions, some of which are very small (200 bp). The goal of BinPolish is to remove some of these sparse, small regions or to summarize them into larger intermediately methylated blocks, thus reducing the large number of small intermediately methylated regions into a set of larger, robust ones. Essentially, we are polishing intermediately methylated segments.

Intermediately methylated regions are ignored that overlap with known regions that have low mappability (source) or are documented blacklisted regions (source). The number of CpGs per region is also calculated using the original data from BISCUIT. Following this, a the main steps employed that merge or remove intermediately methylated (IM) regions are:

Merge IM regions that are separated by 200bp
If small IM regions are within two larger regions, classify as the state of the earlier large region
If two IM regions are separated by a region that has less than one CpG, merge these IM regions
If IM region is classified wrongly, re-classify
Merge IM regions that are separated by 200bp (again)
If a state <=600bp is within two IM states that have higher cpg density, merge
States with CpG density <= 2 are turned to no-data states
If small IM regions are within two larger regions, classify as the state of the earlier large region (again)
Remove small regions (<= 200 bp)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
Snakefile		Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Dependencies

conda

mamba

snakemake (at least v4.3.1)

biopython-1.79

tabix

pandas

bgzip

ChromHMM

bedtools

samtools

samblaster

BISCUIT

R

R dependencies

Downloading JRC_Seeker

Process Overview

Generate methylation data

Binarize methylation data

ChromHMM genome segmentation

BinPolish

About

Releases

Packages

Languages

License

b-kolar/Sample-Code---JRC_Seeker

Folders and files

Latest commit

History

Repository files navigation

Requirements

Dependencies

snakemake (at least v4.3.1)

R

R dependencies

Downloading JRC_Seeker

Process Overview

Generate methylation data

Binarize methylation data

ChromHMM genome segmentation

BinPolish

About

Resources

License

Stars

Watchers

Forks

Languages