Influenza virus next generation sequence workflows.
- Uses IRMA to iteratively refine reference sequences.
- Analyses variants for standard influenza A and B segments, and influenza A splice variants: PA-X, PB1-F2, M1, M2.
- Reports coding effects of nucleotide changes, taking account of multiple SNPs in a codon, and their phase.
- Support for MinION and MiSeq data.
- Quasispecies and mixed sample sequence reconstruction.
The 'pipeline' is really two
snakemake workflows:
preprocess.smk
, which trims reads and runs
quality control measures, and irma.smk
which
runs IRMA and generates summary output.
These workflows call various other programs:
- fastqc is used for generating quality control reports of the raw reads.
- Trimmomatic is used to trim adapters off reads.
- IRMA is used to match reads to flu reference sequences.
- VEP is used to
identify effect of nucleotide changes at the protein level.
- VEP is written in Perl and requires a module called
Bundle::DBI
. Install it withperl -MCPAN -e 'install Bundle::DBI'
- VEP is written in Perl and requires a module called
- tabix is required to
preprocess files for VEP. Install via
perl -MCPAN -e 'install Bio::DB::HTS::Tabix'
. - bgzip & gunzip are used for (de)compression.
- gffread is used to write transcripts from fasta and GFF files.
- transeq is used to translate transcripts.
- chopper is used for filtering and trimming MinION data.
- MinIONQC is used for generating quality control reports of MinION data.
- TenSQR and aBayesQR are used for quasispecies / mixed sample sequence inference.
- minimap2 is used to map reads to IRMA reference sequences for quasispecies / mixed sample reconstruction.
Versions are listed in workflow/envs/*.yaml
.
There are several python scripts in workflow/scripts
which have some
dependencies. Make a python virtual environment and install the requirements with:
pip install -r requirements.txt
Read more about what python virtual environments are, why they are useful, and how to set them up here.
You could use the same virtual environment for each analysis. If you have one setup, then activate it with:
source ~/.virtualenvs/flu-ngs-env/bin/activate
Each time you have samples to run, I would suggest cloning this repository:
git clone git@github.com:IRI-UW-Bioinformatics/flu-ngs.git <name>
where <name>
is the name of the directory that you want, then cd <name>
.
The next step is to put the read files in a structure expected by the workflow.
Put reads in a directory called raw
with the following structure:
raw/
├── trimlog.fas
├── YK_2832
│ ├── YK_2832_1.fastq
│ └── YK_2832_2.fastq
├── YK_2833
│ ├── YK_2833_1.fastq
│ └── YK_2833_2.fastq
├── YK_2834
│ ├── YK_2834_1.fastq
│ └── YK_2834_2.fastq
...
It is fine if the fastq
files are gzipped (i.e. have a .gz
suffix).
Forward and reverse reads should be in {sample}_1.fastq
and {sample}_2.fastq
respectively.
trimlog.fas
should contain the adapters.
Put reads in a directory called raw
. They must be gzipped (end with fastq.gz
).
raw/
├── barcode05
│ ├── FAW31148_pass_barcode05_485f6488_e81f1340_820.fastq.gz
│ ├── FAW31148_pass_barcode05_485f6488_e81f1340_821.fastq.gz
...
├── barcode06
│ ├── FAW31148_pass_barcode06_11f103b9_993a7465_30.fastq.gz
│ ├── FAW31148_pass_barcode06_11f103b9_993a7465_31.fastq.gz
...
The names of subdirectories (barcode05
and barcode06
) define the names of samples. All .fastq.gz
files in each subdirectory are assigned to that sample name.
Run parameters are passed to the workflow by a file called config.json
that should have these keys:
platform
. Eitherminion
ormiseq
.samples
. A list containing the sample IDs to analyse.pair
. A list containing one or more ofpaired
,unpaired
andcombined
. This controls whether to analyse paired, unpaired and/or paired and unpaired reads combined. For MinION datapair
must belongread
. The concept of paired and unpaired reads does not apply to MinION data but for implementation easelongread
is used as a placeholder in the workflow whereverpaired
/unpaired
/combined
would be used in a MiSeq analysis.order
. A list containing one or both ofprimary
andsecondary
. See the IRMA documentation for the distinction between primary and secondary data and residual and secondary assemblies. If bothprimary
andsecondary
are supplied the workflow will do both, and runtime will approximately double.errors
. Eitherwarn
orraise
. This controls error handling for some steps in the workflow.warn
issues warnings if something goes wrong, but will attempt to carry on.raise
would not carry on.qsr
. A list that can either be empty or contain some combination oftensqr
andabayesqr
to run a particular quasispecies spectrum reconstruction (QSR) algorithm. See below for more details.
MiSeq example:
{
"platform": "miseq",
"samples": [
"YK_2837",
"YK_2970"
],
"pair": [
"combined"
],
"order": [
"primary",
"secondary"
],
"errors": "warn",
"qsr": ["tensqr"]
}
MinION example:
{
"platform": "minion",
"samples": [
"barcode05",
"barcode06"
],
"pair": [
"longread"
],
"order": [
"primary",
],
"errors": "warn",
"qsr": ["tensqr"]
}
(To format JSON nicely do: python -m json.tool < dirty.json > clean.json
.)
Reads are first preprocessed. For MiSeq data adapters are trimmed and quality control reports are generated. For MinION data reads are filtered by a min and max length. See workflow/rules/preprocess-{minion,miseq}.smk
for details.
snakemake -s workflow/preprocess.smk -c all
-c all
tells snakemake to use all available cores, scale this back if need be. HTML QC reports for MinION data are saved in results/qc-raw
and results/qc-trimmed
.
Run IRMA and make summary reports of the output:
snakemake -s workflow/irma.smk -c all
Three summary files are generated:
results/<order>/xlsx/variants-mcc-by-sample-ordered.xlsx
. In this version, each sample has its own sheet, and each sheet contains all influenza segments found in that sample.results/<order>/xlsx/variants-mcc-by-segment-ordered.xlsx
. Like above, but each segment gets its own sheet.results/<order>/xlsx/variants-mcc-flat-ordered.xlsx
. This contains all variants in one flat sheet.
IRMA consensus sequences and amino acid translations are put in
results/<order>/seq
.
Splice variants of MP, PA and PB1 are all based on the assumption that IRMA finds canonical length consensus sequences for these segments (see here for more details).
If IRMA finds a consensus sequence for one of these segments that is not the expected length, then the behaviour is determined by the config file:
- If
"errors": "warn"
is used, a logfile is produced at the end of the run (logs/incorrect-splice-vars-<order>.log
) detailing any segments where the consensus was not the expected length. For these, it is highly likely that the variant analysis will be incorrect. - Instead, if
"errors": "raise"
is used in the config, the workflow stops immediately if a length mismatch occurs.
(For NS, we know to expect more variability in segment length, and the locations of exons are flexibly determined.)
The workflow implements a couple of different programs that try to estimate sequences of segments that may differ slightly from the reference sequences by either being part of the viral quasispecies, or from mixed samples. See the TenSQR and aBayesQR repos and papers for more details on these methods. Both programs are quite error prone and may fail for unknown reasons on perfectly good input files. I have tried to put in more catches for the TenSQR errors as it seemed to be slightly better than aBayesQR.
TenSQR and aBayesQR fail to run when passed BAM files generated by IRMA. So, minimap2, is used to
map all reads (paired and unpaired) to the IRMA combined reference sequence. Therefore, if you want
to run these algorithms then you must have combined
in the pair
config file list.
The inferred sequences get put in results/{order}/qsr/{sample}/tensqr_ViralSeq.fasta
(or
{sample}_abayesqr.fasta
), which also contains their estimated frequency in the sample.
If a segment is missing from that file (or that file does not exist) then check:
results/{order}/qsr/{sample}/{segment}/.tensqr_extractmatrix_log.txt
andresults/{order}/qsr/{sample}/{segment}/.tensqr_log.txt
The extract matrix log file comes from the first step in the sequence reconstruction where single nucleotide variants in BAM files get represented as a binary matrix.
A plot that summarises the variants detected in the QSR inference and by IRMA is saved as
results/{order}/qsr/{sample}/tensqr_irma_snps.pdf
(a .png
is also generated). An example looks
like this:
Only SNPs are shown, insertions and deletions are ignored. The frequency of minor variants detected by IRMA are shown as coloured circles. Each sequence that TenSQR found is represented by a horizontal grey line across the plot. The height of the line tells you its frequency. Black crosses on the line indicate the SNPs that the particular reconstructed sequence has. (Lines without any crosses show the reference sequence).
Most software to look at reads alignments require sorted bam files, and/or bam
index files. I've written a small workflow for generating these for all bam
files IRMA generates. It requires samtools.
.sorted.bam
and .sorted.bam.bai
files are saved in the same directory as the
original .bam
files. Do:
snakemake -s workflow/sort-bam.smk -c all