chienlab-rnaseq is a Nextflow pipeline for performing bacterial RNA-Seq analysis.
This pipeline has been heavily inspired by the BactSeq pipeline.
The pipeline will perform the following steps:
- Trim adaptors from reads, performs QC, and filters reads (
) - Align reads to reference genome (
) - Performs read quantificantion (
) - Generates BigWig files for visualization in genome browsers (
) - Size-factor scaling and gene length (RPKM) scaling of counts (TMM from
) - Principal component analysis (PCA) of normalised expression values
- Differential gene expression (
) (optional)
You will need to install Nextflow
(version 21.10.3+).
nextflow run baldikacti/chienlab-rnaseq --data_dir [dir] --sample_file [file] --ref_genome [file] --ref_ann [file] -profile conda [other_options]
Mandatory arguments:
--data_dir [file] Path to directory containing FastQ files.
--ref_genome [file] Path to FASTA file containing reference genome sequence (bwa) or multi-FASTA file containing coding gene sequences (kallisto).
--ref_ann [file] Path to GFF file containing reference genome annotation.
--sample_file [file] Path to file containing sample information.
-profile [str] Configuration profile to use.
Available: conda
Other options:
--cont_tabl [file] Path to tsv file containing contrasts to be performed for differential expression.
--l2fc_thresh [str] Absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.
--outdir [file] The output directory where the results will be saved (Default: './results').
--p_thresh [str] Adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.
--max_memory ['32.GB'] Maximum available memory in the system
--max_cpus [int] Maximum available cpu's in the system
--max_time ['10.h'] Maximum time requested time for the pipeline
-name [str] Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
Explanation of parameters:
: genome sequence for mapping reads.ref_ann
: annotation of genes/features in the reference genome.sample_file
: TSV file containing sample information (see below)data_dir
: path to directory containing FASTQ files.cont_tabl
: (optional) table of contrasts to be performed for differential expression.p_thresh
: adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.l2fc_thresh
: absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.outdir
: the output directory where the results will be saved (Default:./results
: will re-start the pipeline if it has been previously run.
Genome sequence: FASTA file containing the genome sequence. Can be retrieved from NCBI.
Gene annotation file: GFF file containing the genome annotation. Can be retrieved from NCBI.
Sample file: TSV file containing sample information. Must contain the following columns:
: sample IDfile1
: name of the first FASTQ file.file2
: name of the second FASTQ file. (For single-end sequences, leave blank)group
: grouping factor for differential expression and exploratory plots.rep_no
: repeat number (if more than one sample per group).paired
: data are paired-end? (0 = single-end, 1 = paired-end).strandedness
: Is data stranded? Options:unstranded
If data are single-end, leave the
column blank.Sample file can contain a mix of single-end and paired-end, and a mix of stranded and unstranded samples.
sample | file1 | file2 | group | rep_no | paired | strandedness |
AS_1 | SRX1607051_T1.fastq.gz | Artificial_Sputum | 1 | 0 | reverse | |
AS_2 | SRX1607052_T1.fastq.gz | Artificial_Sputum | 2 | 0 | reverse | |
AS_3 | SRX1607053_T1.fastq.gz | Artificial_Sputum | 3 | 0 | reverse | |
MB_1 | SRX1607054_T1.fastq.gz | Middlebrook | 1 | 0 | reverse | |
MB_2 | SRX1607055_T1.fastq.gz | Middlebrook | 2 | 0 | reverse | |
MB_3 | SRX1607056_T1.fastq.gz | Middlebrook | 3 | 0 | reverse |
Optional Contrast Table
contrast1 | contrast2 |
Artificial_Sputum | Middlebrook |
- fastp directory containing adaptor-trimmed RNA-Seq files and QC results.
- read_counts directory containing:
: table of genes in the annotation.gene_counts.tsv
: raw read counts per gene.cpm_counts.tsv
: size factor scaled counts per million (CPM).rpkm_counts.tsv
: size factor scaled and gene length-scaled counts, expressed as reads per kilobase per million mapped reads (RPKM).
- PCA_samples directory containing principal component analysis results.
- diff_expr directory containing differential expression results.
- bigwig directory containing BigWig files.
- bwa_aln directory containing BAM files.
#SBATCH --job-name=chienlab-rnaseq-ba # Job name
#SBATCH --partition=cpu # Partition (queue) name
#SBATCH --ntasks=24 # Number of CPUs
#SBATCH --nodes=1 # Number of nodes
#SBATCH --mem=64gb # Job memory request
#SBATCH --time=06:00:00 # Time limit hrs:min:sec
#SBATCH --output=logs/chienlab-rnaseq-ba_%j.log # Standard output and error log
# Load modules
module load nextflow/24.04.3 conda/latest
# Run pipeline
nextflow run baldikacti/chienlab-rnaseq -r v0.1.0 \
--data_dir /path/to/fastq \
--sample_file /path/to/reference.tsv \
--ref_genome /path/to/organism.fasta \
--ref_ann /path/to/annotation.gff \
--cont_tabl /path/to/contrast_ref.tsv \
--outdir /path/to/results \
-profile conda \