1. Preprocessing, Mapping and QC

How to process RNA-seq data.

0) Understand your data

Before processing, make sure you have a correct understanding of what data you are analysing.

Basic information

  • Sequencing date
  • Contributor: department or labratory
  • Accession: GEO/SRA/PRJNA numbers in ncbi or other dataset ID(clear and brief)
  • Data source: download path or home-made

sample information

  • Organism
  • Disease state: cancer type or healthy
  • Tissue or other source (cell)
  • Molecular: DNA or RNA, total RNA , polyA RNA, or small RNA ?

Sequencing strategy

  • Sequencing platform
  • Library layout: Single-end V.S. Paired-end ?
  • Insert length: 50/100/150 ?
  • Strand specific ?
  • Size selection ?
  • Enrichment: Poly-A enriched or total RNA(ribosomal RNA removed) ?
  • Cellular localization: whole cell or intranuclear of cytoplasmic ?

When analyses published datasets, you can get all these information from ncbi website.

Our example HCC exRNA-seq data

Basic information

  • Sequencing date: 2017.11-
  • Contributor: Lulab Tsinghua University
  • Accession: NA
  • Data source: home-made

sample information

  • Organism: Homo sapiens
  • Disease state: hepatocellular carcinoma(HCC)
  • Tissue or other source: plasma
  • Molecular: total RNA(fragment)

Sequencing strategy

  • Sequencing platform: Illunima HiSeq X
  • Library layout: Paired-end 150
  • Insert length(RNA length): ~30 nt
  • Strand specific: Yes
  • Size selection: none
  • Enrichment: total RNA without ribosomal RNA removed
  • Cellular localization: whole extracellular circulating RNA

Flow chart of exRNA-seq processing method

Input/output of each procedure

Step Input Tool/script Output Note
1.Preprocessing 00.rawdata/*.fastq - 02.rRNA/*/*.no_rRNA.fastq ~/projects/exRNA/ hcc_examples/
1.1 index ~/genome/human_hg38/gtf/*.gtf - ~/genome/human_hg38/index/bowtie2_index/*.bt2 -
1.1.1 .gtf to .fa ~/genome/human_hg38/sequence/GRCh38.p12.genome.fa ~/genome/human_hg38/gtf/*.gtf bedtools *.fa -
1.1.2 .fa to .bt2 *.fa bowtie2-build *.bt2 -
1.2 fastqc 00.rawdata/*.fastq fastqc 00.rawdata/*_fastqc.html check raw reads' quality
1.3 Remove 3' adapter and trim reads 00.rawdata/*.fastq cutadapt 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq trim low quality ends (plus a hard cutoff: 50nt)
1.4 2nd fastqc 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq fastqc 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3_fastqc.html make sure the low quality reads have been removed and/or trimmed
1.5 Remove ribosomal RNA 01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq + index file bowtie2 02.rRNA/sampleID/sampleID.rRNA.sam 02.rRNA/sampleID/sampleID.no_rRNA.fastq -
2.Mapping 02.rRNA/sampleID/sampleID.no_rRNA.fastq + index files bowtie2 03.mapping/sampleID/01.miRNA/sampleID.miRNA.sam 03.mapping/sampleID/01.miRNA/sampleID.no_miRNA.fastq -> 03.mapping/sampleID/02.piRNA/sampleID.piRNA.sam 03.mapping/sampleID/02.piRNA/sampleID.no_piRNA.fastq -> ... map to different RNA annotations step by step

1) Preprocessing

Quality control of raw reads, and extract the clean RNA sequence.

0. File format

0.1 Annotation file format (.gtf or .gff)

The Gene transfer format (GTF) is a tab-delimited file format used to hold information about gene annotation.


chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
  • You can get Gencode27 annotation files at: /BioII/lulab_b/shared/genomes/human_hg38/gtf/
  • Annotation file(.gtf/.gff) download source or prossed method, please refer to README file under this directory.

0.2 FASTQ Format (.fastq or .fq)

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.






Example file location


Present working directory:

your own project directory: ~/projects/exRNA/hcc_examples/

Create symbolic links:

mkdir 00.rawdata
cd 00.rawdata
ln -s /BioII/lulab_b/shared/projects/exRNA/hcc_examples/fastq/* .

Create a file under ~/projects/exRNA/hcc_examples/ to record name of these files, named as "sample_name", like:


1.1 build index

1.1.1 Link genomic annotation file to your home directory

Present working directory: your own home.

mkdir -p genome/human_hg38/index/bowtie2_index
mkdir genome/human_hg38/gtf
mkdir genome/human_hg38/sequence
ln -s /BioII/lulab_b/shared/genomes/human_hg38/gtf/* ./genome/human_hg38/gtf/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/sequence/GRCh38.p12.genome.fa ./genome/human_hg38/sequence/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/index/bowtie2_index/bowtie2_hg38_index/ genome/human_hg38/index/bowtie2_index/

1.1.2 Get RNA sequence(.fa) from annotation file(.gtf)

Use rRNA as example.

  • Tool: bedtools - getfasta
  • Annotation file: rRNA_exon.gtf

Present working directory: ~/genome/human_hg38/index/bowtie2_index/

mkdir 00.bowtie2_rRNA_index
cd 00.bowtie2_rRNA_index/
bedtools getfasta -fi ../../../sequence/GRCh38.p12.genome.fa -bed ../../../gtf/rRNA_exon.gff -fo rRNA.fa

1.1.3 Build bowtie2 index

Use rRNA as example.

Present working directory: ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index

bowtie2-build rRNA.fa rRNA

Output files:


NOTE: Use exon sequences to build index.

1.2 fastqc

Use tool fastqc to test the quality of raw reads.

fastqc AfterSurgery_1.fastq

Processing multiple files.

Write the following command to "fastqc.sh":

#! /bin/bash

for i in `cat ../sample_name`
fastqc ${i}.fastq;

Run "fastqc.sh"

bash fastqc.sh &

Check the fastqc results

  • Confirm the fastqc results files *_fastqc.html and *_fastqc.zip;
  • Copy *_fastqc.html to your computer and open in a web browser;
  • Check the quality of raw reads;
  • Filter the low quality samples, which median of "per base sequence quality" less than 28(median line under green area).

fastqc results.

1.3 Remove adapters and trim reads

The raw reads always with adapters, which should be removed to get clean reads.

Use cutadapt toolkit to remove adapters and trim some low quality reads.
_Present working directory: ~/projects/exRNA/hcc_examples/

  mkdir -p 01.cutAdapter/NC_1
  cutadapt -u -100 -q 30,30 --trim-n -m 15 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --too-short-output=./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.TooShort.fastq -o ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq ./00.rawdata/NC_1/NC_1.fastq >./01.cutAdapter/NC_1/NC_1.cutAdapt3.log

Above is for one sample, you can use bash script to process multiple samples.

You should check the cutadapt log file (sample.cutAdapt3.log) to get information about adapter removed reads.
cutadapt log file.

Get more information about cutadapt toolkit, click here: [cutadapt.]

1.4 2nd fastQC

A second FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent.

Run fastqc of the adapter removed reads

fastqc ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq

Differences in results with 1st fastQC:

  • The per-base quality scores should be different.
  • The Sequencing adapters are no longer identified as over-represented.

1.5 Remove ribosomal RNA mapped reads

Why remove ribosomal RNA ?

  • Ribosomal RNA(rRNA) hold more than 80-90% of total RNA.
  • Large proportions of rRNA will have an effect on the usable number of effective reads obtained from the samples.
  • rRNA over-expressed samples should be filtered.

Toolkit: Bowtie2
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning short reads.

Mapping reads to rRNA

Present working directory: ~/projects/exRNA/hcc_examples

mkdir -p 02.rRNA/NC_1
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index/rRNA -S 02.rRNA/NC_1/NC_1.rRNA.sam --un ./02.rRNA/NC_1/NC_1.no_rRNA.fastq ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq > 02.rRNA/NC_1/NC_1.rRNA.log

If want get a .bam file output, use samtools:

samtools view -bS in.sam >out.bam

Check the log file (NC_1.rRNA.log) to calculate the rRNA mapping ratio.
bowtie2 log file.

Above is for one sample, you can use bash script to process multiple samples.

2) Mapping

Mapping reads to human different RNA types.

  • Input: 02.rRNA/*/*.no_rRNA.fastq
  • Tool: bowtie2.
  • Working directory:~/projects/exRNA/hcc_examples/
mkdir -p 03.mapping/NC_1
  • Ordered mapping to avoid multiple assign reads to different RNA types:

    • 01.miRNA
    • 02.piRNA
    • 03.Y RNA
    • 04.snRNA
    • 05.srpRNA
    • 06.tRNA
    • 07.lncRNA
    • 08.mRNA
    • 09.other human genome
    • 10.non-human genome
  • Use the unmapped .fastq file from last step as input to the ordered mapping.

  • Methods: you can follow "1.4 mapping reads to rRNA" above.

01. Use rRNA removed reads as input to mapping miRNA
mkdir 03.mapping/NC_1/01.miRNA
bowtie2 --sensitive-local  -x ~/genome/human_hg38/index/bowtie2_index/01.bowtie2_miRNA_index/miRNA -S 03.mapping/NC_1/01.miRNA/NC_1.miRNA.sam --un ./03.mapping/NC_1/01.miRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/NC_1.no_rRNA.fastq  > 03.mapping/NC_1/01.miRNA/NC_1.miRNA.log
02. Use miRNA removed reads as input to mapping piRNA
mkdir 03.mapping/NC_1/02.piRNA
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/02.bowtie2_piRNA_index/piRNA -S 03.mapping/NC_1/02.piRNA/NC_1.piRNA.sam --un ./03.mapping/NC_1/02.piRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/02.piRNA/NC_1.no_rRNA.fastq > 03.mapping/NC_1/02.piRNA/NC_1.miRNA.log
03.Y_RNA, 04.snRNA, 05.srpRNA, 06.tRNA, 07.lncRNA, 08.mRNA, 09.hg38
  • Follow the methods above.
  • Reads can't mapped to these 8 types RNA would be assigned as "other human genome".
  • Reads can't mapped to human genome would be assigned as "non-human".

3) QC (Quality Control)

When processing the real data from some experiments, there are many things can affect the RNA-seq results. So it is important to take all possibilities into consideration, and filter the low quality sample in downstream analyses.

After preprocessing and mapping, you can do some basic statistics of the prossesed datasets.

3.1 Data summary

3.1.1 Summarize the pre-processing and mapping results before downstream analyses.

The summary should contain the following part at least :


  • sample_id
  • sequenced date or batch number
  • Raw reads number (raw fastq file or cutadapt log file)


  • Clean reads number and percentage of raw reads (cutadapt log file)
  • rRNA number and percentage of clean reads (rRNA mapping log file of bowtie2)
  • kept reads (rRNA mapping log file of bowtie2)

Mapping ratio

  • human genome (human genome mapping log file of bowtie2)
  • miRNA
  • piRNA
  • Y RNA
  • snRNA
  • srpRNA
  • tRNA
  • lncRNA
  • mRNA
  • Other human genome region (difference between whole genome and mapped RNA)
  • Non-human genome (human genome mapping log file of bowtie2)

Reference. You can combine some bash command or use your own code to summarize the data processing information. Here we shared a simple bash script for your reference. Github link.

3.1.2 Visualize the reads length distribution by scatter plot/bar chart.


3.1.3 Check whether the mapping ratio are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.

You can combine some bash command or use your own code to summarise reads length. Here we shared a simple bash script for your reference.

3.2 Reads length distribution

3.2.1 Calculate the reads length :

  • raw reads (raw fastq file).
  • clean reads (rRNA removed and low-quality trimmed reads).
  • mapped reads of miRNA, piRNA, Y RNA, snRNA, srpRNA, tRNA, lncRNA, mRNA (mapped result sam file).

    NOTE: The sam/bam files record mapping information of all reads, include mapped and unmapped. Take care of the "flag" in 2nd column of sam/bam file. file format.

You can combine some bash command or use your own code to summarize reads length. Here we shared a simple bash script for your reference. Github link.

3.2.2 Visualize the reads length distribution by scatter plot/bar chart.


3.2.3 Check whether the reads length distribution are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.

3.3 Quality control criteria:

You can define a set of criteria to QC and filter your data.
Here we shared our QC criteria for your reference.

Check point Threshold Input
Raw reads quality reads quality >28 (median lines in green area) Check fastqc results(*.html)
Clean reads number > 10 million cutadapt log file
rRNA ratio < 10% bowtie2 rRNA mapping log file
human genome mapped ratio > 65% bowtie2 human genome mapping log file
other genome region ratio < 10% difference between whole genome and mapped RNA

4) Tips

  • Terminal theme: grey background Monaco 18 (aequilate font, suitable for programming)

  • Snakemake a good workflow management system to create reproducible and scalable data analyses. Snakemake.

    • Here is a Snakemake example to identify circRNA,location: /BioII/lulab_b/shared/shared_scripts/Snakemake/examples.Identify_circRNA/
    • We shared our snakemake package used to exRNA-seq basic analyses.Github link.

5) Homework

Level 1
  1. Learn more about Illunima RNA-seq types, library preparation and analysis methods.
  2. Run hcc_example samples data by following this tutorial step by step, and check your results (sam or bam files) by comparing with our results (/home/wangsiqi/project/hcc_example) using diff command.
Level 2
  • Summarize real data (61 samples of Lulab HCC cell-free see details in github) :
    1. Data summary table (ref: 3.1.1).
    2. Mapping ratio plot (ref: 3.1.2).
    3. Reads length distribution plot (ref: 3.2.2).
    4. Filter low quality samples (ref: 3.3).

Level 3

  • Comparing the fasta file you just generated from our gtf file with miRbase 22 fasta file ( downloading link). Explain the difference. For example, I got this:
grep  hsa  ~/Downloads/hairpin.fa |wc -l
grep -c ">" RNAs_fasta/miRNA.fa
  • Remove 5` adapter from sample_R2.fastq file, compare with R1 clean reads. Explain the potential reason if they are inconsistent.

Here is a figure illustrating the difference between Read 1 and Read 2 in a paired sequencing reads.

results matching ""

    No results matching ""