Skip to content

Cluster Support

Chee-Hong Wong edited this page Dec 31, 2017 · 1 revision

If it is taking too long to process your .fastq file and you have access to a cluster running PBS, you can use the following procedure to tap your cluster compute power.

Step 1 : Configure the template file

Picky preparepbs will assume that the template file is named "template.pbs".

You may create "template.pbs" with

./picky.pl preparepbs --init template.pbs

The default template.pbs content is:

#!/bin/bash
#PBS -l nodes=1:ppn=16
#PBS -l walltime=01:00:00
#PBS -l mem=18GB

cd "$PBS_O_WORKDIR"
export LASTAL=last-755/src/lastal
export LASTALDB=hg19.lastdb
export LASTALDBFASTA=hg19.fa
export PICKY=./picky.pl
export RUN=
time (${LASTAL} -v -C2 -K2 -r1 -q3 -a2 -b1 -v -P16 -Q1 ${LASTALDB} ${RUN}.fastq 2>${RUN}.lastal.log | ${PICKY} selectRep --thread 16 --preload 6 1>${RUN}.align 2>${RUN}.selectRep.log)
time (cat ${RUN}.align | ${PICKY} callSV --oprefix ${RUN}.sv --fastq ${RUN}.fastq --genome ${LASTALDBFASTA} --exclude=chrM --sam)

IMPORTANT: You MUST leave the line "export RUN=" as Picky will initialize the corresponding value for each chunk.

Depending on your cluster environment, you will have to set the various resource settings appropriately. The (above) default settings work for a 1000-reads chunk in our environment with a 2-fold buffer for execution time and an 1.8GB buffer for memory using 16 cores.

In addition, you should configure the first four export lines according to your installation and project.

Step 2 : Creating the chunked fastq and cluster script

Once your template.pbs has been configured for your cluster environment, you are ready to chunk the fastq (says "SCP20.fastq") and write the corresponding PBS script with:

./picky.pl preparepbs --fastq SCP20.fastq

In the above example, SCP20.fastq was converted from the public ONT dataset Scrappie chr20 FASTA using kent-util's faToFastq as follow:

# download faToFastq
curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToFastq

# make user executable
chmod u+x faToFastq

# download Scrappie based-called chr20 reads
curl -O http://s3.amazonaws.com/nanopore-human-wgs/na12878.chr20ScrappieFiltered.fasta

# convert fasta to fastq with default base quality 'H'
./faToFastq -qual=H na12878.chr20ScrappieFiltered.fasta SCP20.fastq

The file contains 277,054 reads. Picky preparepbs will generate 278 chunk .fastq files (SCP20-c000001.fastq to SCP20-c000278.fastq) and the corresponding 278 PBS scripts (SCP20-c000001.pbs to SCP20-c000278.pbs).

Step 3 : Submitting the cluster jobs

You can submit the generated .pbs script according to your cluster configuration. If there is no restriction to the number of submitted jobs per user, you can submit all the script at once with:

for i in SCP20-c??????.pbs ; do echo ${i}; qsub ${i}; done

Step 4 : Combining the results

Each chunk will have produced its own set of results as outlined in the output documentation. There are numerious way to combine the individual chunk of results into a single result set.

Option 1: Concatenating SV .xls files

If one is interested in the deletion, the consolidated result can be generated as follow:

cat SCP20-c??????.sv.profile.DEL.xls > SCP20.all.sv.profile.DEL.xls
./picky.pl xls2vcf \
  --xls SCP20.all.sv.profile.DEL.xls \
  > SCP20.all.sv.profile.DEL.vcf

This can be repeated for all other SV types.

Option 2: Concatenating repsentative alignments .align files

If one will like to proper run level auxiliary files, it is better to concatenate the chunks' .align file and re-perform Pikcy callSV.

cat SCP20-c??????.align > SCP20.all.align
cat SCP20.all.align \
  | ./picky.pl callSV \
    --oprefix SCP20.all \
    --fastq SCP20.fastq \
    --exclude=chrM \
    --sam 2>SCP20.all.callSV.log

Option 3: Concatenating some of the auxiliary files

Most often, the .sam file is the only auxiliary file one needs. As Option 2 does take up additional storage and time, one may just merge the chunks' .sam file by handling the extraneous sam headers appropriately.