2. Differential Expression Analysis

2.1 Construction of expression matrix

2.1.1 HTSeq

Given a file with aligned sequencing reads(.sam/.bam) and a list of genomic features(.gtf), a common task is to count how many reads map to each feature(gene).

Usage

htseq-count [options] <alignment_files> <gff_file>
  • 3 overlap resolution models
  • We shared our snakemake package used for exRNA-seq expression matrix construction.Github.

Tips

  • --nonunique
    • --nonunique none (default): the read (or read pair) is counted as ambiguous and not counted for any features. Also, if the read (or read pair) aligns to more than one location in the reference, it is scored as alignment_not_unique.
    • --nonunique all: the read (or read pair) is counted as ambiguous and is also counted in all features to which it was assigned. Also, if the read (or read pair) aligns to more than one location in the reference, it is scored as alignment_not_unique and also separately for each location.
    • Notice that when using --nonunique all the sum of all counts will not be equal to the number of reads (or read pairs), because those with multiple alignments or overlaps get scored multiple times.

Notes

  • -m/--mode {mode}
  • --nonunique={none/all}
  • -s/--stranded {yes/no/reverse}.
  • -a {minaqual}.
  • -t/--type {feature type}. (defult: exon)
  • -i/--idattr {id attribute}, GFF attribute to be used as feature ID. (defult: gene_id)

2.1.2 featureCounts

Usage

Summarize a BAM format dataset:

featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt mapping_results_SE.bam

Summarize multiple datasets at the same time:

featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt library1.bam library2.bam library3.bam

Tips
By default, featureCounts does not count reads overlapping with more than one feature. Users can use the -O option to instruct featureCounts to count such reads (they will be assigned to all their overlapping features or meta-features).

2.2 Filter out lower-expressed genes

Purpose
Filtering low-expression genes improved DEG detection sensitivity.
Methods

  • Filter the genes with a total counts < = threshold;
  • Filter genes with maximum normalized counts across all samples < = threshold;
  • Filter genes with mean normalized counts across all samples < = threshold.

Criteria

Retain genes: (Reads >= 1 counts) >= 20% samples.

2.3 Normalization

Systematic biases

Size factor & global-scale normalization

Notes
Marie-Agnès Dillies, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, Volume 14, Issue 6, 1 November 2013, Pages 671–683 link
Catalina A Vallejos, et al. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods volume 14, pages 565–571 link

2.4 Commonly used tools

2.4.1 DESeq2

The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models
Usage

> dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition)
> dds <- DESeq(dds)
> resultsNames(dds) # lists the coefficients
> res <- results(dds, name="condition_trt_vs_untrt")
# or to shrink log fold changes association with condition:
> res <- lfcShrink(dds, coef="condition_trt_vs_untrt")

Tips
Analyzing RNA-seq data with DESeq2

Notes

We shared scripts on github.

  • DESeq2 normalization: R package DESeq2.Github.

2.4.2 edgeR

Usage

> x <- read.delim("TableOfCounts.txt",row.names="Symbol")
> group <- factor(c(1,1,2,2))
> y <- DGEList(counts=x,group=group)
> y <- calcNormFactors(y)
> design <- model.matrix(~group)
> y <- estimateDisp(y,design)
To perform quasi-likelihood F-tests:
> fit <- glmQLFit(y,design)
> qlf <- glmQLFTest(fit,coef=2)
> topTags(qlf)
To perform likelihood ratio tests:
> fit <- glmFit(y,design)
> lrt <- glmLRT(fit,coef=2)
> topTags(lrt)

Tips
edgeR Users Guide

Notes

We shared scripts on github.

  • TMM normalization: R package edgeR.Github.

2.4.3 RPM

Usage

  1. normalize the reads by library size (edgeR)
  2. identify differential expressed gene using wilcoxon.test()

We shared scripts on github.

  • RPM normalization: R package edgeR to calculate RPM, then test by R function wilcoxon.test().Github.

2.5 Optional: rank-based methods

X Li, et al. A rank-based algorithm of differential expression
analysis for small cell line data with statistical control. Briefings in Bioinformatics, 2017, 1–10 link

2.6 Homeworks

Level I: learn how to calculate differential expression

Level II:

  1. compare the difference between HTSeq and featureCounts
  2. compare the results of differential expression, draw venn plot to show the difference.

Level III:

  1. Try to run rank-based method on our data.
  2. Try to implement the methods using R/Python.

results matching ""

    No results matching ""