1. Preprocessing, Mapping and QC

How to process RNA-seq data.

0) Understand your data

Before processing, make sure you have a correct understanding of what data you are analysing.

Basic information

Sequencing date

Contributor: department or labratory

Accession: GEO/SRA/PRJNA numbers in ncbi or other dataset ID(clear and brief)

Data source: download path or home-made

sample information

Organism

Disease state: cancer type or healthy

Tissue or other source (cell)

Molecular: DNA or RNA, total RNA , polyA RNA, or small RNA ?

Sequencing strategy

Sequencing platform

Library layout: Single-end V.S. Paired-end ?

Insert length: 50/100/150 ?

Strand specific ?

Size selection ?

Enrichment: Poly-A enriched or total RNA(ribosomal RNA removed) ?

Cellular localization: whole cell or intranuclear of cytoplasmic ?

When analyses published datasets, you can get all these information from ncbi website.

Our example HCC exRNA-seq data

Basic information

Sequencing date: 2017.11-
Contributor: Lulab Tsinghua University
Accession: NA
Data source: home-made

sample information

Organism: Homo sapiens
Disease state: hepatocellular carcinoma(HCC)
Tissue or other source: plasma
Molecular: total RNA(fragment)

Sequencing strategy

Sequencing platform: Illunima HiSeq X
Library layout: Paired-end 150
Insert length(RNA length): ~30 nt
Strand specific: Yes
Size selection: none
Enrichment: total RNA without ribosomal RNA removed
Cellular localization: whole extracellular circulating RNA

Flow chart of exRNA-seq processing method

Input/output of each procedure

Step	Input	Tool/script	Output	Note
1.Preprocessing	00.rawdata/*.fastq	-	02.rRNA//.no_rRNA.fastq	~/projects/exRNA/ hcc_examples/
1.1 index	~/genome/human_hg38/gtf/*.gtf	-	~/genome/human_hg38/index/bowtie2_index/*.bt2	-
1.1.1 .gtf to .fa	~/genome/human_hg38/sequence/GRCh38.p12.genome.fa ~/genome/human_hg38/gtf/*.gtf	bedtools	*.fa	-
1.1.2 .fa to .bt2	*.fa	bowtie2-build	*.bt2	-
1.2 fastqc	00.rawdata/*.fastq	fastqc	00.rawdata/*_fastqc.html	check raw reads' quality
1.3 Remove 3' adapter and trim reads	00.rawdata/*.fastq	cutadapt	01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq	trim low quality ends (plus a hard cutoff: 50nt)
1.4 2nd fastqc	01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq	fastqc	01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3_fastqc.html	make sure the low quality reads have been removed and/or trimmed
1.5 Remove ribosomal RNA	01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq + index file	bowtie2	02.rRNA/sampleID/sampleID.rRNA.sam 02.rRNA/sampleID/sampleID.no_rRNA.fastq	-
2.Mapping	02.rRNA/sampleID/sampleID.no_rRNA.fastq + index files	bowtie2	03.mapping/sampleID/01.miRNA/sampleID.miRNA.sam 03.mapping/sampleID/01.miRNA/sampleID.no_miRNA.fastq -> 03.mapping/sampleID/02.piRNA/sampleID.piRNA.sam 03.mapping/sampleID/02.piRNA/sampleID.no_piRNA.fastq -> ...	map to different RNA annotations step by step

1) Preprocessing

Quality control of raw reads, and extract the clean RNA sequence.

0. File format

0.1 Annotation file format (.gtf or .gff)

The Gene transfer format (GTF) is a tab-delimited file format used to hold information about gene annotation.

Example

chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

You can get Gencode27 annotation files at: /BioII/lulab_b/shared/genomes/human_hg38/gtf/
Annotation file(.gtf/.gff) download source or prossed method, please refer to README file under this directory.

0.2 FASTQ Format (.fastq or .fq)

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

Example

@EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCC

+

;;3;;;;;;;;;;;;7;;;;;;;88

Example file location

/BioII/lulab_b/shared/projects/exRNA/hcc_examples/

Present working directory:

your own project directory: ~/projects/exRNA/hcc_examples/

Create symbolic links:

mkdir 00.rawdata
cd 00.rawdata
ln -s /BioII/lulab_b/shared/projects/exRNA/hcc_examples/fastq/* .

Create a file under ~/projects/exRNA/hcc_examples/ to record name of these files, named as "sample_name", like:

AfterSurgery_1
AfterSurgery_2
AfterSurgery_3
BeforeSurgery_1
BeforeSurgery_2
BeforeSurgery_3
NC_1
NC_2
NC_3

1.1 build index

1.1.1 Link genomic annotation file to your home directory

Present working directory: your own home.

mkdir -p genome/human_hg38/index/bowtie2_index
mkdir genome/human_hg38/gtf
mkdir genome/human_hg38/sequence
ln -s /BioII/lulab_b/shared/genomes/human_hg38/gtf/* ./genome/human_hg38/gtf/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/sequence/GRCh38.p12.genome.fa ./genome/human_hg38/sequence/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/index/bowtie2_index/bowtie2_hg38_index/ genome/human_hg38/index/bowtie2_index/

1.1.2 Get RNA sequence(.fa) from annotation file(.gtf)

Use rRNA as example.

Tool: bedtools - getfasta
Annotation file: rRNA_exon.gtf

Present working directory: ~/genome/human_hg38/index/bowtie2_index/

mkdir 00.bowtie2_rRNA_index
cd 00.bowtie2_rRNA_index/
bedtools getfasta -fi ../../../sequence/GRCh38.p12.genome.fa -bed ../../../gtf/rRNA_exon.gff -fo rRNA.fa

1.1.3 Build bowtie2 index

Use rRNA as example.

Present working directory: ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index

bowtie2-build rRNA.fa rRNA

Output files:

rRNA.1.bt2 
rRNA.2.bt2  
rRNA.3.bt2  
rRNA.4.bt2 
rRNA.rev.1.bt2  
rRNA.rev.2.bt2

NOTE: Use exon sequences to build index.

1.2 fastqc

Use tool fastqc to test the quality of raw reads.

fastqc AfterSurgery_1.fastq

Processing multiple files.

Write the following command to "fastqc.sh":

#! /bin/bash

for i in `cat ../sample_name`
do
fastqc ${i}.fastq;
done

Run "fastqc.sh"

bash fastqc.sh &

Check the fastqc results

Confirm the fastqc results files *_fastqc.html and *_fastqc.zip;
Copy *_fastqc.html to your computer and open in a web browser;
Check the quality of raw reads;
Filter the low quality samples, which median of "per base sequence quality" less than 28(median line under green area).

fastqc results.

1.3 Remove adapters and trim reads

The raw reads always with adapters, which should be removed to get clean reads.

Use cutadapt toolkit to remove adapters and trim some low quality reads.
_Present working directory: ~/projects/exRNA/hcc_examples/

  mkdir -p 01.cutAdapter/NC_1
  cutadapt -u -100 -q 30,30 --trim-n -m 15 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --too-short-output=./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.TooShort.fastq -o ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq ./00.rawdata/NC_1/NC_1.fastq >./01.cutAdapter/NC_1/NC_1.cutAdapt3.log

Above is for one sample, you can use bash script to process multiple samples.

You should check the cutadapt log file (sample.cutAdapt3.log) to get information about adapter removed reads.
cutadapt log file.

Get more information about cutadapt toolkit, click here: [cutadapt.]
(http://cutadapt.readthedocs.io/en/stable/index.html)

1.4 2nd fastQC

A second FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent.

Run fastqc of the adapter removed reads

fastqc ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq

Differences in results with 1st fastQC:

The per-base quality scores should be different.
The Sequencing adapters are no longer identified as over-represented.

1.5 Remove ribosomal RNA mapped reads

Why remove ribosomal RNA ?

Ribosomal RNA(rRNA) hold more than 80-90% of total RNA.
Large proportions of rRNA will have an effect on the usable number of effective reads obtained from the samples.
rRNA over-expressed samples should be filtered.

Toolkit: Bowtie2
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning short reads.

Mapping reads to rRNA

Present working directory: ~/projects/exRNA/hcc_examples

mkdir -p 02.rRNA/NC_1
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index/rRNA -S 02.rRNA/NC_1/NC_1.rRNA.sam --un ./02.rRNA/NC_1/NC_1.no_rRNA.fastq ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq > 02.rRNA/NC_1/NC_1.rRNA.log

If want get a .bam file output, use samtools:

samtools view -bS in.sam >out.bam

Check the log file (NC_1.rRNA.log) to calculate the rRNA mapping ratio.
bowtie2 log file.

Above is for one sample, you can use bash script to process multiple samples.

2) Mapping

Mapping reads to human different RNA types.

Input: 02.rRNA/*/*.no_rRNA.fastq
Tool: bowtie2.
Working directory:~/projects/exRNA/hcc_examples/

mkdir -p 03.mapping/NC_1

Ordered mapping to avoid multiple assign reads to different RNA types:
- 01.miRNA
- 02.piRNA
- 03.Y RNA
- 04.snRNA
- 05.srpRNA
- 06.tRNA
- 07.lncRNA
- 08.mRNA
- 09.other human genome
- 10.non-human genome
Use the unmapped .fastq file from last step as input to the ordered mapping.
Methods: you can follow "1.4 mapping reads to rRNA" above.

01. Use rRNA removed reads as input to mapping miRNA

mkdir 03.mapping/NC_1/01.miRNA
bowtie2 --sensitive-local  -x ~/genome/human_hg38/index/bowtie2_index/01.bowtie2_miRNA_index/miRNA -S 03.mapping/NC_1/01.miRNA/NC_1.miRNA.sam --un ./03.mapping/NC_1/01.miRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/NC_1.no_rRNA.fastq  > 03.mapping/NC_1/01.miRNA/NC_1.miRNA.log

02. Use miRNA removed reads as input to mapping piRNA

mkdir 03.mapping/NC_1/02.piRNA
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/02.bowtie2_piRNA_index/piRNA -S 03.mapping/NC_1/02.piRNA/NC_1.piRNA.sam --un ./03.mapping/NC_1/02.piRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/02.piRNA/NC_1.no_rRNA.fastq > 03.mapping/NC_1/02.piRNA/NC_1.miRNA.log

03.Y_RNA, 04.snRNA, 05.srpRNA, 06.tRNA, 07.lncRNA, 08.mRNA, 09.hg38

Follow the methods above.
Reads can't mapped to these 8 types RNA would be assigned as "other human genome".
Reads can't mapped to human genome would be assigned as "non-human".

3) QC (Quality Control)

When processing the real data from some experiments, there are many things can affect the RNA-seq results. So it is important to take all possibilities into consideration, and filter the low quality sample in downstream analyses.

After preprocessing and mapping, you can do some basic statistics of the prossesed datasets.

3.1 Data summary

3.1.1 Summarize the pre-processing and mapping results before downstream analyses.

The summary should contain the following part at least :

Sample_info

sample_id
sequenced date or batch number
Raw reads number (raw fastq file or cutadapt log file)

Preprocessing

Clean reads number and percentage of raw reads (cutadapt log file)
rRNA number and percentage of clean reads (rRNA mapping log file of bowtie2)
kept reads (rRNA mapping log file of bowtie2)

Mapping ratio

human genome (human genome mapping log file of bowtie2)
miRNA
piRNA
Y RNA
snRNA
srpRNA
tRNA
lncRNA
mRNA
Other human genome region (difference between whole genome and mapped RNA)
Non-human genome (human genome mapping log file of bowtie2)

Reference. You can combine some bash command or use your own code to summarize the data processing information. Here we shared a simple bash script for your reference. Github link.

3.1.2 Visualize the reads length distribution by scatter plot/bar chart.

Reference.

3.1.3 Check whether the mapping ratio are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.

You can combine some bash command or use your own code to summarise reads length. Here we shared a simple bash script for your reference.

3.2 Reads length distribution

3.2.1 Calculate the reads length :

raw reads (raw fastq file).
clean reads (rRNA removed and low-quality trimmed reads).
mapped reads of miRNA, piRNA, Y RNA, snRNA, srpRNA, tRNA, lncRNA, mRNA (mapped result sam file).

NOTE: The sam/bam files record mapping information of all reads, include mapped and unmapped. Take care of the "flag" in 2nd column of sam/bam file. file format.

You can combine some bash command or use your own code to summarize reads length. Here we shared a simple bash script for your reference. Github link.

3.2.2 Visualize the reads length distribution by scatter plot/bar chart.

Reference.

3.2.3 Check whether the reads length distribution are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.

3.3 Quality control criteria:

You can define a set of criteria to QC and filter your data.
Here we shared our QC criteria for your reference.

Check point	Threshold	Input
Raw reads quality	reads quality >28 (median lines in green area)	Check fastqc results(*.html)
Clean reads number	> 10 million	cutadapt log file
rRNA ratio	< 10%	bowtie2 rRNA mapping log file
human genome mapped ratio	> 65%	bowtie2 human genome mapping log file
other genome region ratio	< 10%	difference between whole genome and mapped RNA

4) Tips

Terminal theme: grey background Monaco 18 (aequilate font, suitable for programming)
Snakemake a good workflow management system to create reproducible and scalable data analyses. Snakemake.
- Here is a Snakemake example to identify circRNA,location: /BioII/lulab_b/shared/shared_scripts/Snakemake/examples.Identify_circRNA/
- We shared our snakemake package used to exRNA-seq basic analyses.Github link.

5) Homework

Level 1

Learn more about Illunima RNA-seq types, library preparation and analysis methods.
Run hcc_example samples data by following this tutorial step by step, and check your results (sam or bam files) by comparing with our results (/home/wangsiqi/project/hcc_example) using diff command.

Level 2

Summarize real data (61 samples of Lulab HCC cell-free see details in github) :
1. Data summary table (ref: 3.1.1).
2. Mapping ratio plot (ref: 3.1.2).
3. Reads length distribution plot (ref: 3.2.2).
4. Filter low quality samples (ref: 3.3).

Level 3

Comparing the fasta file you just generated from our gtf file with miRbase 22 fasta file ( downloading link). Explain the difference. For example, I got this:

grep  hsa  ~/Downloads/hairpin.fa |wc -l
1917
grep -c ">" RNAs_fasta/miRNA.fa
1869

Remove 5` adapter from sample_R2.fastq file, compare with R1 clean reads. Explain the potential reason if they are inconsistent.

Here is a figure illustrating the difference between Read 1 and Read 2 in a paired sequencing reads.

1.Preprocessing, Mapping and QC