Analyze Unmapped Reads

This script is used for the analysis of unmapped reads from next generation sequencing (NGS) data.

Because this script uses scaffolds, it is most effective when the target DNA exists at a high enough depth to ensure sequence continuity. For example, it has been successfully used to identify:

Integrated HHV6 full sequences (~140kb) in 30x WGS data
Parvovirus B19 (5596bp) in samples with mean viral depth >1
Rickettsia Rickettsii (2.1Mb) fragments in a sample with mean viral depth <1

Overview

Broadly, this script works in three steps:

Collect Unmapped Reads Samtools and Bedtools are used to extract all unmapped reads and their mates from bam/cram input
Generate Scaffolds SPAdes is used to generate scaffolds from the unmapped reads
Blast Scaffolds *Blast is used to compare scaffolds to known sequences and identify likely sources of the

Required Programs

This script uses the following programs; alternate versions may also work:

Flags

Flag	Argument	Default	Description
-b	filename (required unless using -q)	-	BAM/CRAM input file
-o	string (required)	-	the basename for the output file
-a	integer (optional)	5	the number of blast alignments
-c	integer (optional)	8	the number of threads to allow for parallel processing
-d	integer (optional)	20	dustmasker level
-f	no argument (optional)	-	stop script after retreiving fastq files for unaligned reads
-k	no argument (optional)	-	keep unmapped fastq files
-l	blast db (optional)	/n/shared_db/blastdb/202008	full path to blast database
-m	integer (optional)	500	minimum scaffold length to keep
-n	integer (optional)	10	max percentage of Ns in order to keep a scaffold
-q	string (optional)	-	use zipped pair1 fastq file as input (fmt: pair1/2 or read1/2)
-r	no argument (optional)	-	use RNA-Seq data
-s	no argument (optional)	-	skip blast step
-t	temp directory (optional)	.	a temporary directory to store intermediate files
-h	no argument (optional)	-	print usage

Example Usage

To use defaults, but keep unmapped read fastq output:

./analyze-unmapped-reads.sh -b sample1.cram -o sample1 -k

To only return fastq files for unaligned reads:

./analyze-unmapped-reads.sh -b sample1.cram -o sample1 -f

To use RNA-Seq data and limit scaffold lengths to 1kb or longer:

./analyze-unmapped-reads.sh -b sample1.cram -o sample1 -r -m 1000

Output

File	Description
*fasta	A fasta file of the final set of scaffolds
*fastq.gz	Paired fastq files for the samples' unaligned reads
*blast[nx].txt	a text file with blast results

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AUR-graphic.png		AUR-graphic.png
README.md		README.md
analyze-unmapped-reads.sh		analyze-unmapped-reads.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyze Unmapped Reads

Overview

Required Programs

Flags

Example Usage

Output

About

Releases

Packages

Languages

jalwillcox/Analyze-Unmapped-Reads

Folders and files

Latest commit

History

Repository files navigation

Analyze Unmapped Reads

Overview

Required Programs

Flags

Example Usage

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages