Skip to content

Latest commit

 

History

History
236 lines (168 loc) · 8.11 KB

06_expression.md

File metadata and controls

236 lines (168 loc) · 8.11 KB

Expression

Use htseq-count to generate expression estimates from the SAM/BAM files generated by HISAT2 in the previous session. We will work on them with two short R scripts.

HTSEQ-COUNT

Run htseq-count on alignment files to produce raw counts for differential expression analysis in the next session. The program examines each read mapping coordinates to see which gene it comes from. Three modes are available:

ercc_mix

Refer to the HTSeq documentation for a more detailed explanation:

htseq-count basic usage:

htseq-count [options] <sam_file> <gff_file|gtf_file>

htseq-count options are specified below:

  • --format - specify the input file format one of BAM or SAM. Since we have BAM format files, select 'bam' for this option.
  • --order - provide the expected sort order of the input file. Previously we generated position sorted BAM files so use 'pos'.
  • --mode - determines how to deal with reads that overlap more than one feature. We believe the 'intersection-strict' mode is best.
  • --stranded - specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the 'reverse' option for this parameter.
  • --minaqual - will skip all reads with alignment quality lower than the given minimum value
  • --type - specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)
  • --idattr - The feature ID used to identify the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.
#!/bin/bash -l
HTSEQ_COUNT_DIR=${RNA_HOME}/expression/htseq_counts

mkdir -p $HTSEQ_COUNT_DIR

cd $RNA_ALIGN_DIR

for bam_base in `ls *_Rep?.bam | rev | cut -c 5- | rev`
do
        echo "running htseq-count on $bam_base ..."
        
        htseq-count \
          --format bam \
          --order pos \
          --mode intersection-strict \
          --stranded reverse \
          --minaqual 1 \
          --type exon \
          --idattr gene_id \
          ${bam_base}.bam $RNA_REF_GTF \
          > ${HTSEQ_COUNT_DIR}/${bam_base}_gene.tsv

        echo "Done."
done

echo "Check the htseq-count output file  ${HTSEQ_COUNT_DIR}/${bam_base}_gene.tsv"

Run the bash script above to generate a gene-level read count file,

rc
bash ./htseq_count.sh

Calculate gene lengths

A longer gene likely receives more read counts whereas a shorter gene receives fewer reads. Since we use a parameter --type exon in the run, which considers reads aligned within an exonic region, we count non-overlapped exonic base pairs for each gene before normalization.

Let us launch Rstudio

binf
cd refs/
rstudio gene_id_len.r &
  • Change font style so that cursor aligns with what you are typing in the editor or console. From the rstudio menu:

    Tools > Global Options > Appearance > Editor font > Select 'Nimbus Mono L'

These are the contents of gene_id_len.r:

library(data.table)
source('../tools/readGtf.r')

#load gtf file into data.table class
gtf <- readGtf(gtf_file='chr22_with_ERCC92.gtf')

#convert to GRanges object for better BED operation
library(GenomicRanges)
gtf_gr <- GRanges(seqnames=gtf$chr,
	ranges=IRanges(gtf$start,gtf$end),
	strand=gtf$strand,
	gene_name=gtf$gene_name,
	gene_id=gtf$gene_id)

#collapse all overlapped exons to make genomic regions not overlapped
red_gr <- unlist(reduce(split(gtf_gr, elementMetadata(gtf_gr)$gene_id)))

#back to data.table class
gtf_dt <- data.table(gene_id=names(ranges(red_gr)),width=width(red_gr))

#sum exonic regions per gene_id  
gtf_dt <- gtf_dt[,sum(width),by=gene_id]

#print out the table to a tab delimited file
fwrite(gtf_dt,file='chr22_with_ERCC92.gtf_len_by_gene.tsv',
	quote=FALSE,sep='\t',col.names=FALSE)

To run R scripts:

  • Move your cursor on the first line and hold Ctrl and hit enter to run the line.

  • In order to run more than one line, highlight the lines you want to run, and then hold Ctrl and hit enter.

  • In order to run the entire lines in the script:

    • Click Code > Source from menu or Ctrl+Shift+5 to run the R script from rstudio

or,

  • From ssh terminal, run gene_id_len.r at $RNA_REFS_DIR:
cd $RNA_REFS_DIR
Rscript ./gene_id_len.r
  • The command Rscript allows you to execute R code.

Check the output file, chr22_with_ERCC92.gtf_len_by_gene.tsv

Build a read count matrix from htseq-count output files

Merge results files into a single matrix to use in DESeq2 in the next session. The following step joins the results for each replicate together, adds a header, reformats the result as a tab-delimited file, and shows you the first 10 lines of the resulting file :

cd $RNA_HOME/expression/htseq_counts

join UHR_Rep1_gene.tsv UHR_Rep2_gene.tsv | \
	join - UHR_Rep3_gene.tsv | \
	join - HBR_Rep1_gene.tsv | \
	join - HBR_Rep2_gene.tsv | \
	join - HBR_Rep3_gene.tsv | \
	join - $RNA_HOME/refs/chr22_with_ERCC92.gtf_len_by_gene.tsv > \
	 gene_read_counts_table_all.tsv

echo "GeneID UHR_Rep1 UHR_Rep2 UHR_Rep3 HBR_Rep1 HBR_Rep2 HBR_Rep3 gene_length" > header.txt

cat header.txt gene_read_counts_table_all.tsv | \
	grep -v "__" | \
	perl -ne 'chomp $_; $_ =~ s/\s+/\t/g; print "$_\n"' > \
	gene_read_counts_table_all_final.tsv

rm -f gene_read_counts_table_all.tsv header.txt

head gene_read_counts_table_all_final.tsv

or you can run the bash script instead and display the first 10 lines of the merged file like this:

rc
bash ./build_read_count_matrix.sh
head -n 10 gene_read_counts_table_all_final.tsv

Calculate TPM (transcript per millions) in log2 value

By normalizing the read count matrix where sequencing depth in library varies, we would like to highlight which genes are expressed more than the others within a sample.

Let Ki be the number of reads aligned to a particular feature (e.g., gene i). Denote the gene length with exons by li. TPM (Transcripts Per Kilobase Million) can be calculated as:

TPM_i = \frac{K_i}{l_i}\frac{1}{\sum_j\frac{K_j}{l_j}}10^6

These are the contents of tpm_log2.r:

library(data.table)

gene_read_count_fn <- "gene_read_counts_table_all_final.tsv"

grc <- fread(gene_read_count_fn)
grc <- grc[!startsWith(GeneID,"ERCC"),]
rc <- as.matrix(grc[,2:7])

#normalize by gene length
rownames(rc) <- grc$GeneID
rcpl <- rc/grc$gene_length

#normalize by sequencing depth
depth_sf <- colSums(rcpl)/1e6
rc_tpm <- log2(rcpl/depth_sf)
rc_tpm[is.infinite(rc_tpm)] <- 0.

log2tpm_file <- sprintf('%s.log2tpm.tsv',gene_read_count_fn)
write.table(rc_tpm,file=log2tpm_file,
	quote=F,sep='\t',col.names=NA,row.names=T)

Run this R script at $RNA_HOME/expression/htseq_counts:

rc
Rscript ./tpm_log2.r
less -S gene_read_counts_table_all_final.tsv.log2tpm.tsv

ERCC expression analysis

Based on the above read counts, plot the linearity of the ERCC spike-in read counts versus the known concentration of the ERCC spike-in Mix. In this step, we will first prepare a file describing the expected concentrations and fold-change differences for the ERCC spike-in reagent. Next, we will use a Perl script to organize the ERCC expected values and our observed counts for each ERCC sequence. Finally, we will use an R script to produce an x-y scatter plot that compares the expected and observed values.

Run the following bash script $RNA_HOME/expression/htseq_counts/qc_by_ercc.sh,

rc

less ERCC_Controls_Analysis.txt

perl ./Tutorial_ERCC_expression.pl

# The columns (ID, Subgroup, Concentration) are from a vendor inventory 
# information ERCC_Control_Analysis.txt file and 
# two columns (Label, Count) are attached from our read count file 
# (gene_read_counts_table_all_final.tsv)

less ercc_read_counts.tsv

Rscript ./Tutorial_ERCC_expression.R ercc_read_counts.tsv

To view the resulting figure, open the PDF file:

evince Tutorial_ERCC_expression.pdf &

Q6.1 What is the correlation coefficient?

Q6.2 Can you observe any problematic spike-in control?

Up next

Differential Expression