bash: How to set up multiple jobs as a pipeline

1. Creating a script

You may like to create one directory like to hold your scripts

mkdir ~/scripts, and to put this path to .bashrc or .bash_profile

In order to ensure that no confusion can rise, bash script names often end in ".sh".

create one example bash scripts

vi ~/scripts/myBash1.sh

#!/bin/bash
clear
echo "This is information provided by mysystem.sh.  Program starts now."

echo "Hello, $USER"

echo "Today's date is `date`, this is week `date +"%V"`."
echo "This is `uname -s` running on a `uname -m` processor."

echo "This is the uptime information:"
uptime

echo "I'm creating two variables"
USERS=`uptime | cut -d "," -f 3`
VALUE="4"
echo "There are$USERS have used this computer."
echo "This is the number: $VALUE"

(Tips: if you use vim, you may like to activate syntax highlighting, type ":syntax enable" in vim, you can add this setting to your .vimrc file to make it permanent.)

2. Running a script

The script should have execute permissions for the correct owners in order to be runnable.

chmod u+x ~/scripts/myBash1.sh

type ~/scripts/myBash1.sh, bash ~/scripts/myBash1.sh or bash -x ~/scripts/myBash1.sh to run the script.

> ~/scripts/myBash1.sh
This is information provided by mysystem.sh. Program starts now.

Hello, liyang

Today's date is Wed Mar 21 22:07:26 CST 2018, this is week 12.
This is Darwin running on a x86_64 processor.

This is the uptime information:
22:07  up 30 days, 12:42, 14 users, load averages: 1.41 1.67 1.74

I'm creating two variables
There are 14 users have used this computer.
This is the number: 4

3. Bash basics

3.1 Variables (page 299)

To set a variable in the shell, use

VARNAME="value"

Setting and exporting is usually done in one step:

export VARNAME="value"

3.2 Quoting characters (page 327)

Escape characters:

echo $date
echo \$date

Single quotes:

echo '$date'

Double quotes:

echo "$date"
echo "date"
echo "I said: \"Hello World!\""

3.3 Shell expansion (page 325)

Brace expansion {}:

echo sp{el,il,al}l

Variable expansion $:

echo $SHELL
echo ${FRANKY:=Franky}

Command substitution:

echo $(date)
echo date
echo date
Arithmetic expansion:

echo $((365*24))
echo $[365*24]

3.4 Regular expressions (page 346)

Operator	Effect
.	Matches any single character.
?	The preceding item is optional and will be matched, at most, once.
*	The preceding item will be matched zero or more times.
+	The preceding item will be matched one or more times.
{N}	The preceding item is matched exactly N times.
{N,}	The preceding item is matched N or more times.
{N,M}	The preceding item is matched at least N times, but not more than M times.
-	represents the range if it's not first or last in a list or the ending point of a range in a list.
^	Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.
$	Matches the empty string at the end of a line.
\b	Matches the empty string at the edge of a word.
\B	Matches the empty string provided it's not at the edge of a word.
\<	Match the empty string at the beginning of word.
\>	Match the empty string at the end of word.

3.5 grep, awk, sed, pipe(|), cut, sort, uniq, join, cat, paste (Week 1)

3.6 Conditional statements (page 379)

general:

Primary	Meaning
[-a FILE]	True if FILE exists.
[-b FILE]	True if FILE exists and is a block-special file.
[-c FILE]	True if FILE exists and is a character-special file.
[-d FILE]	True if FILE exists and is a directory.
[-e FILE]	True if FILE exists.
[-f FILE]	True if FILE exists and is a regular file.
[-g FILE]	True if FILE exists and its SGID bit is set.
[-h FILE]	True if FILE exists and is a symbolic link.
[-k FILE]	True if FILE exists and its sticky bit is set.
[-p FILE]	True if FILE exists and is a named pipe (FIFO).
[-r FILE]	True if FILE exists and is readable.
[-s FILE]	True if FILE exists and has a size greater than zero.
[-t FD]	True if file descriptor FD is open and refers to a terminal.
[-u FILE]	True if FILE exists and its SUID (set user ID) bit is set.
[-w FILE]	True if FILE exists and is writable.
[-x FILE]	True if FILE exists and is executable.
[-O FILE]	True if FILE exists and is owned by the effective user ID.
[-G FILE]	True if FILE exists and is owned by the effective group ID.
[-L FILE]	True if FILE exists and is a symbolic link.
[-N FILE]	True if FILE exists and has been modified since it was last read.
[-S FILE]	True if FILE exists and is a socket.
[FILE1 -nt FILE2]	True if FILE1 has been changed more recently than FILE2, or if FILE1 exists and FILE2 does not.
[FILE1 -ot FILE2]	True if FILE1 is older than FILE2, or is FILE2 exists and FILE1 does not.
[FILE1 -ef FILE2]	True if FILE1 and FILE2 refer to the same device and inode numbers.

for loop:

for NAME [in LIST ]; do COMMANDS; done

while loop:

while CONTROL-COMMAND; do CONSEQUENT-COMMANDS; done

until loop:

until TEST-COMMAND; do CONSEQUENT-COMMANDS; done

break and continue

3.7 Functions

FUNCTION () { COMMANDS; }

4. Example

Assuming that you have 5 sequencing data, you are trying to check the mapping quality for each sample based on the output log. The output log is looks like:

> ll
drwxr-xr-x@ 7 liyang  staff  238 Mar 22 01:08 01.input
drwxr-xr-x@ 3 liyang  staff  102 Mar 22 01:24 02.output
-rw-r--r--  1 liyang  staff  145 Mar 22 01:27 README
drwxr-xr-x  3 liyang  staff  102 Mar 22 01:26 bin

> cd 01.input
> ls
sample1 sample2 sample3 sample4 sample5

> cd sample1

> cat Log.final.out
                                 Started job on |       Dec 22 13:14:21
                             Started mapping on |       Dec 22 14:10:01
                                    Finished on |       Dec 22 14:49:53
       Mapping speed, Million of reads per hour |       29.16

                          Number of input reads |       19372132
                      Average input read length |       51
                                    UNIQUE READS:
                   Uniquely mapped reads number |       17395896
                        Uniquely mapped reads % |       89.80%
                          Average mapped length |       50.93
                       Number of splices: Total |       5038
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       3677
                       Number of splices: GC/AG |       158
                       Number of splices: AT/AC |       8
               Number of splices: Non-canonical |       1195
                      Mismatch rate per base, % |       0.25%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.48
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.36
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1080034
             % of reads mapped to multiple loci |       5.58%
        Number of reads mapped to too many loci |       189582
             % of reads mapped to too many loci |       0.98%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       2.54%
                     % of reads unmapped: other |       1.10%

Based on these information, you need to write one bash script to extract the number of input reads, uniquely mapped reads number and multi-mapped reads number, and generate one summary file. Usually, the uniquely mapped ratio were used to measure the mapping quality. The sample do not pass the criteria should be labeled.

#!/usr/bin/bash
set -o nounset
set -o errexit

#echo "$OPTIND start at $OPTIND"
while getopts ":i:o:n:p:" optname; do
   case $optname in
      i)
         input="$OPTARG";;
      o)
         outputDir="$OPTARG";;
      n)
         cutoff="$OPTARG";;
      p)
         prefix="$OPTARG";;
      ?)
         echo "Usage: `basename $0` -i input -o outputDir -n cutoff -p prefix";;
      :)
         echo "No argument value for option $OPTARG";;
   esac
#   echo "$OPTIND is now $OPTIND"
#   echo $#
done;

# Initialize variables
if [ $# -eq 8 ]; then
outputDir="${outputDir%*/}"
echo "The input file is "`basename ${input}`
echo "The output directory in "${outputDir}
echo "The cutoff is "${cutoff}
echo "the prefix for output file is "${prefix}

# get values from input file
totalN=`cat ${input} | grep 'Number of input reads' | cut -f 2`
uniqN=`cat ${input} | grep 'Uniquely mapped reads number' | cut -f 2`
ratio=`bc <<< "scale=4; $uniqN/$totalN"`
multiN=`cat ${input} | grep 'Number of reads mapped to multiple loci' | cut -f 2`

if (( $(echo "$ratio > $cutoff" | bc -l) )); then
   echo "The mapping result is pass the cutoff"
   echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"pass"}' >> $outputDir/$prefix.sta
else
   echo "The mapping result is NOT pass the cutoff"
   echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"fail"}' >> $outputDir/$prefix.sta
fi

echo "Job finished!"
fi

Run the script:

for i in `ls 01.input/`;

do echo $i;

bash bin/sta.sh -i 01.input/$i/Log.final.out -o 02.output/ -n 0.9 -p summary;

done

Homework

level 1: type the code in your computer and understand the meaning for each command.

download link: Week_2_files/bash_example.zip

level 2: try to write one bash script to check the md5 of files in folder and output the file name of the truncated file.

> cd homework/checkMD5/
> ls
file1      file2      file3      file4      file5      md5sum.txt

> cat md5sum.txt
MD5 (./file1) = 15e8d15469ba992caac192b8684396a3
MD5 (./file2) = 85dc26fde0917b4dbba6339607b63d31
MD5 (./file3) = 62fdd313368e8125115ef53a3caaf95b
MD5 (./file4) = d09c83de50bb2da6c0c824fe44ba887a
MD5 (./file5) = db648585dd242f899818247643e599dd

download link: Week_2_files/homework/checkMD5.zip

level 3: try to write one bash script your own.

Reference

https://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html

bash basics