Skip to content
Chee-Hong Wong edited this page Jan 1, 2018 · 10 revisions

Welcome to the Picky wiki!

What is Picky?

Picky is a structural variant pipeline for long reads developed by Genome Technologies, The Jackson Laboratory. Picky was initially designed for Oxford Nanopore long reads, but will also work for PacBio reads. Picky uses LAST to generate all possible alignments for a read, and 'pick' the representative alignments with a greedy algorithm instead of using last-split. Picky's picking does not assume colinearity nor mandate that each read base must only be aligned to at most a single genomic location.

  1. Why build another structural variant pipeline?

  2. How do I process Oxford nanopore sequencing data for structural variants?

  3. How do I process PacBio sequencing data for structural variants?

  4. What do I make of the output?

  5. How do I scale up with my cluster?

Download

Picky release

Latest LAST release

Quick start

Let's assume that LAST is installed, and you have your human sample sequencer output in "LongRead.fastq".

Human lastdb construction

last-755/src/lastdb -v -P 1 hg19.lastdb hg19.fa

Picky processes

We use 4 threads for the compute intensive read alignment process in this example. For faster turn-around, use more threads if your machine has more compute power.

# generate the bash script for processing
./picky.pl script --fastq LongRead.fastq --thread 4 > LongRead.sh

# change script to be executable
chmod u+x LongRead.sh

# start the script
./LongRead.sh

At the end of the script execution, you will the SV record in the vcf file "LongRead.allsv.vcf", and the alignments in "LongRead.profile.sam".

See Oxford nanopore and PacBio sequence data processing for more details. See cluster support on scaling up on a PBS cluster.

Performance

The public ONT dataset Scrappie chr20 FASTA with 277,054 reads took 2.5hr using 16 cores and 20GB memory.

Demo data

You could try out the public ONT dataset Scrappie chr20 FASTA as Picky's demonstration data.

# download faToFastq
curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToFastq

# make user executable
chmod u+x faToFastq

# download Scrappie based-called chr20 reads
curl -O http://s3.amazonaws.com/nanopore-human-wgs/na12878.chr20ScrappieFiltered.fasta

# convert fasta to fastq with default base quality 'H'
./faToFastq -qual=H na12878.chr20ScrappieFiltered.fasta LongRead.fastq

You may download the demo data results for comparison or for better understanding the output without pipeline execution. (.sam excluded by file size limit.)

The Picky Script

The generated content LongRead.sh from the previous step is as shown below and you may customize it before execution or re-execution. Depending on your installation, you have to adapt the first 3 blocks (of exports).

# general installation
export LASTAL=last-755/src/lastal
export PICKY=./picky.pl

# general configuration
export LASTALDB=hg19.lastdb
export LASTALDBFASTA=hg19.fa
export NTHREADS=4

# FASTQ file
export RUN=LongRead

# reads alignments
time (${LASTAL} -C2 -K2 -r1 -q3 -a2 -b1 -v -v -P${NTHREADS} -Q1 ${LASTALDB} ${RUN}.fastq 2>${RUN}.lastal.log | ${PICKY} selectRep --thread ${NTHREADS} --preload 6 1>${RUN}.align 2>${RUN}.selectRep.log)

# call SVs
time (cat ${RUN}.align | ${PICKY} callSV --oprefix ${RUN} --fastq ${RUN}.fastq --genome ${LASTALDBFASTA} --exclude=chrM --sam)

# generate VCF format
${PICKY} xls2vcf --xls ${RUN}.profile.DEL.xls --xls ${RUN}.profile.INS.xls --xls ${RUN}.profile.INDEL.xls --xls ${RUN}.profile.INV.xls --xls ${RUN}.profile.TTLC.xls --xls ${RUN}.profile.TDSR.xls --xls ${RUN}.profile.TDC.xls > ${RUN}.allsv.vcf
Clone this wiki locally