Use htseq-count
to generate expression estimates from the SAM/BAM files generated by HISAT2 in the previous session. We will work on them with two short R scripts.
Run htseq-count
on alignment files to produce raw counts for differential expression analysis in the next session. The program examines each read mapping coordinates to see which gene it comes from. Three modes are available:
Refer to the HTSeq documentation for a more detailed explanation:
htseq-count basic usage:
htseq-count [options] <sam_file> <gff_file|gtf_file>
htseq-count
options are specified below:
--format
- specify the input file format one of BAM or SAM. Since we have BAM format files, select 'bam' for this option.--order
- provide the expected sort order of the input file. Previously we generated position sorted BAM files so use 'pos'.--mode
- determines how to deal with reads that overlap more than one feature. We believe the 'intersection-strict' mode is best.--stranded
- specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the 'reverse' option for this parameter.--minaqual
- will skip all reads with alignment quality lower than the given minimum value--type
- specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)--idattr
- The feature ID used to identify the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.
#!/bin/bash -l
HTSEQ_COUNT_DIR=${RNA_HOME}/expression/htseq_counts
mkdir -p $HTSEQ_COUNT_DIR
cd $RNA_ALIGN_DIR
for bam_base in `ls *_Rep?.bam | rev | cut -c 5- | rev`
do
echo "running htseq-count on $bam_base ..."
htseq-count \
--format bam \
--order pos \
--mode intersection-strict \
--stranded reverse \
--minaqual 1 \
--type exon \
--idattr gene_id \
${bam_base}.bam $RNA_REF_GTF \
> ${HTSEQ_COUNT_DIR}/${bam_base}_gene.tsv
echo "Done."
done
echo "Check the htseq-count output file ${HTSEQ_COUNT_DIR}/${bam_base}_gene.tsv"
Run the bash script above to generate a gene-level read count file,
rc
bash ./htseq_count.sh
A longer gene likely receives more read counts whereas a shorter gene receives fewer reads. Since we use a parameter --type exon
in the run, which considers reads aligned within an exonic region, we count non-overlapped exonic base pairs for each gene before normalization.
Let us launch Rstudio
binf
cd refs/
rstudio gene_id_len.r &
-
Change font style so that cursor aligns with what you are typing in the editor or console. From the rstudio menu:
Tools > Global Options > Appearance > Editor font > Select 'Nimbus Mono L'
These are the contents of gene_id_len.r
:
library(data.table)
source('../tools/readGtf.r')
#load gtf file into data.table class
gtf <- readGtf(gtf_file='chr22_with_ERCC92.gtf')
#convert to GRanges object for better BED operation
library(GenomicRanges)
gtf_gr <- GRanges(seqnames=gtf$chr,
ranges=IRanges(gtf$start,gtf$end),
strand=gtf$strand,
gene_name=gtf$gene_name,
gene_id=gtf$gene_id)
#collapse all overlapped exons to make genomic regions not overlapped
red_gr <- unlist(reduce(split(gtf_gr, elementMetadata(gtf_gr)$gene_id)))
#back to data.table class
gtf_dt <- data.table(gene_id=names(ranges(red_gr)),width=width(red_gr))
#sum exonic regions per gene_id
gtf_dt <- gtf_dt[,sum(width),by=gene_id]
#print out the table to a tab delimited file
fwrite(gtf_dt,file='chr22_with_ERCC92.gtf_len_by_gene.tsv',
quote=FALSE,sep='\t',col.names=FALSE)
To run R scripts:
-
Move your cursor on the first line and hold Ctrl and hit enter to run the line.
-
In order to run more than one line, highlight the lines you want to run, and then hold Ctrl and hit enter.
-
In order to run the entire lines in the script:
- Click
Code > Source
from menu orCtrl+Shift+5
to run the R script from rstudio
- Click
or,
- From ssh terminal, run
gene_id_len.r
at$RNA_REFS_DIR
:
cd $RNA_REFS_DIR
Rscript ./gene_id_len.r
- The command
Rscript
allows you to execute R code.
Check the output file, chr22_with_ERCC92.gtf_len_by_gene.tsv
Merge results files into a single matrix to use in DESeq2 in the next session. The following step joins the results for each replicate together, adds a header, reformats the result as a tab-delimited file, and shows you the first 10 lines of the resulting file :
cd $RNA_HOME/expression/htseq_counts
join UHR_Rep1_gene.tsv UHR_Rep2_gene.tsv | \
join - UHR_Rep3_gene.tsv | \
join - HBR_Rep1_gene.tsv | \
join - HBR_Rep2_gene.tsv | \
join - HBR_Rep3_gene.tsv | \
join - $RNA_HOME/refs/chr22_with_ERCC92.gtf_len_by_gene.tsv > \
gene_read_counts_table_all.tsv
echo "GeneID UHR_Rep1 UHR_Rep2 UHR_Rep3 HBR_Rep1 HBR_Rep2 HBR_Rep3 gene_length" > header.txt
cat header.txt gene_read_counts_table_all.tsv | \
grep -v "__" | \
perl -ne 'chomp $_; $_ =~ s/\s+/\t/g; print "$_\n"' > \
gene_read_counts_table_all_final.tsv
rm -f gene_read_counts_table_all.tsv header.txt
head gene_read_counts_table_all_final.tsv
or you can run the bash script instead and display the first 10 lines of the merged file like this:
rc
bash ./build_read_count_matrix.sh
head -n 10 gene_read_counts_table_all_final.tsv
By normalizing the read count matrix where sequencing depth in library varies, we would like to highlight which genes are expressed more than the others within a sample.
Let Ki be the number of reads aligned to a particular feature (e.g., gene i). Denote the gene length with exons by li. TPM (Transcripts Per Kilobase Million) can be calculated as:
These are the contents of tpm_log2.r
:
library(data.table)
gene_read_count_fn <- "gene_read_counts_table_all_final.tsv"
grc <- fread(gene_read_count_fn)
grc <- grc[!startsWith(GeneID,"ERCC"),]
rc <- as.matrix(grc[,2:7])
#normalize by gene length
rownames(rc) <- grc$GeneID
rcpl <- rc/grc$gene_length
#normalize by sequencing depth
depth_sf <- colSums(rcpl)/1e6
rc_tpm <- log2(rcpl/depth_sf)
rc_tpm[is.infinite(rc_tpm)] <- 0.
log2tpm_file <- sprintf('%s.log2tpm.tsv',gene_read_count_fn)
write.table(rc_tpm,file=log2tpm_file,
quote=F,sep='\t',col.names=NA,row.names=T)
Run this R script at $RNA_HOME/expression/htseq_counts:
rc
Rscript ./tpm_log2.r
less -S gene_read_counts_table_all_final.tsv.log2tpm.tsv
Based on the above read counts, plot the linearity of the ERCC spike-in read counts versus the known concentration of the ERCC spike-in Mix. In this step, we will first prepare a file describing the expected concentrations and fold-change differences for the ERCC spike-in reagent. Next, we will use a Perl script to organize the ERCC expected values and our observed counts for each ERCC sequence. Finally, we will use an R script to produce an x-y scatter plot that compares the expected and observed values.
Run the following bash script $RNA_HOME/expression/htseq_counts/qc_by_ercc.sh
,
rc
less ERCC_Controls_Analysis.txt
perl ./Tutorial_ERCC_expression.pl
# The columns (ID, Subgroup, Concentration) are from a vendor inventory
# information ERCC_Control_Analysis.txt file and
# two columns (Label, Count) are attached from our read count file
# (gene_read_counts_table_all_final.tsv)
less ercc_read_counts.tsv
Rscript ./Tutorial_ERCC_expression.R ercc_read_counts.tsv
To view the resulting figure, open the PDF file:
evince Tutorial_ERCC_expression.pdf &