bash: How to set up multiple jobs as a pipeline

1. Creating a script

You may like to create one directory like to hold your scripts

mkdir ~/scripts, and to put this path to .bashrc or .bash_profile

In order to ensure that no confusion can rise, bash script names often end in ".sh".

create one example bash scripts

vi ~/scripts/myBash1.sh

#!/bin/bash
clear
echo "This is information provided by mysystem.sh.  Program starts now."

echo "Hello, $USER"

echo "Today's date is `date`, this is week `date +"%V"`."
echo "This is `uname -s` running on a `uname -m` processor."

echo "This is the uptime information:"
uptime

echo "I'm creating two variables"
USERS=`uptime | cut -d "," -f 3`
VALUE="4"
echo "There are$USERS have used this computer."
echo "This is the number: $VALUE"

(Tips: if you use vim, you may like to activate syntax highlighting, type ":syntax enable" in vim, you can add this setting to your .vimrc file to make it permanent.)

2. Running a script

The script should have execute permissions for the correct owners in order to be runnable.

chmod u+x ~/scripts/myBash1.sh

type ~/scripts/myBash1.sh, bash ~/scripts/myBash1.sh or bash -x ~/scripts/myBash1.sh to run the script.

> ~/scripts/myBash1.sh
This is information provided by mysystem.sh. Program starts now.

Hello, liyang

Today's date is Wed Mar 21 22:07:26 CST 2018, this is week 12.
This is Darwin running on a x86_64 processor.

This is the uptime information:
22:07  up 30 days, 12:42, 14 users, load averages: 1.41 1.67 1.74

I'm creating two variables
There are 14 users have used this computer.
This is the number: 4

3. Bash basics

3.1 Variables (page 299)

To set a variable in the shell, use

VARNAME="value"

Setting and exporting is usually done in one step:

export VARNAME="value"

3.2 Quoting characters (page 327)

  • Escape characters:
echo $date
echo \$date
  • Single quotes:
echo '$date'
  • Double quotes:
echo "$date"
echo "date"
echo "I said: \"Hello World!\""

3.3 Shell expansion (page 325)

  • Brace expansion {}:
echo sp{el,il,al}l
  • Variable expansion $:
echo $SHELL
echo ${FRANKY:=Franky}
  • Command substitution:

    echo $(date)
    echo date
    echo date

  • Arithmetic expansion:

echo $((365*24))
echo $[365*24]

3.4 Regular expressions (page 346)

Operator Effect
. Matches any single character.
? The preceding item is optional and will be matched, at most, once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{N} The preceding item is matched exactly N times.
{N,} The preceding item is matched N or more times.
{N,M} The preceding item is matched at least N times, but not more than M times.
- represents the range if it's not first or last in a list or the ending point of a range in a list.
^ Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.
$ Matches the empty string at the end of a line.
\b Matches the empty string at the edge of a word.
\B Matches the empty string provided it's not at the edge of a word.
\< Match the empty string at the beginning of word.
\> Match the empty string at the end of word.

3.5 grep, awk, sed, pipe(|), cut, sort, uniq, join, cat, paste (Week 1)

3.6 Conditional statements (page 379)

  • general:
Primary Meaning
[-a FILE] True if FILE exists.
[-b FILE] True if FILE exists and is a block-special file.
[-c FILE] True if FILE exists and is a character-special file.
[-d FILE] True if FILE exists and is a directory.
[-e FILE] True if FILE exists.
[-f FILE] True if FILE exists and is a regular file.
[-g FILE] True if FILE exists and its SGID bit is set.
[-h FILE] True if FILE exists and is a symbolic link.
[-k FILE] True if FILE exists and its sticky bit is set.
[-p FILE] True if FILE exists and is a named pipe (FIFO).
[-r FILE] True if FILE exists and is readable.
[-s FILE] True if FILE exists and has a size greater than zero.
[-t FD] True if file descriptor FD is open and refers to a terminal.
[-u FILE] True if FILE exists and its SUID (set user ID) bit is set.
[-w FILE] True if FILE exists and is writable.
[-x FILE] True if FILE exists and is executable.
[-O FILE] True if FILE exists and is owned by the effective user ID.
[-G FILE] True if FILE exists and is owned by the effective group ID.
[-L FILE] True if FILE exists and is a symbolic link.
[-N FILE] True if FILE exists and has been modified since it was last read.
[-S FILE] True if FILE exists and is a socket.
[FILE1 -nt FILE2] True if FILE1 has been changed more recently than FILE2, or if FILE1 exists and FILE2 does not.
[FILE1 -ot FILE2] True if FILE1 is older than FILE2, or is FILE2 exists and FILE1 does not.
[FILE1 -ef FILE2] True if FILE1 and FILE2 refer to the same device and inode numbers.
  • for loop:

for NAME [in LIST ]; do COMMANDS; done

  • while loop:

while CONTROL-COMMAND; do CONSEQUENT-COMMANDS; done

  • until loop:

until TEST-COMMAND; do CONSEQUENT-COMMANDS; done

  • break and continue

3.7 Functions

FUNCTION () { COMMANDS; }

4. Example

Assuming that you have 5 sequencing data, you are trying to check the mapping quality for each sample based on the output log. The output log is looks like:

> ll
drwxr-xr-x@ 7 liyang  staff  238 Mar 22 01:08 01.input
drwxr-xr-x@ 3 liyang  staff  102 Mar 22 01:24 02.output
-rw-r--r--  1 liyang  staff  145 Mar 22 01:27 README
drwxr-xr-x  3 liyang  staff  102 Mar 22 01:26 bin

> cd 01.input
> ls
sample1 sample2 sample3 sample4 sample5

> cd sample1

> cat Log.final.out
                                 Started job on |       Dec 22 13:14:21
                             Started mapping on |       Dec 22 14:10:01
                                    Finished on |       Dec 22 14:49:53
       Mapping speed, Million of reads per hour |       29.16

                          Number of input reads |       19372132
                      Average input read length |       51
                                    UNIQUE READS:
                   Uniquely mapped reads number |       17395896
                        Uniquely mapped reads % |       89.80%
                          Average mapped length |       50.93
                       Number of splices: Total |       5038
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       3677
                       Number of splices: GC/AG |       158
                       Number of splices: AT/AC |       8
               Number of splices: Non-canonical |       1195
                      Mismatch rate per base, % |       0.25%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.48
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.36
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1080034
             % of reads mapped to multiple loci |       5.58%
        Number of reads mapped to too many loci |       189582
             % of reads mapped to too many loci |       0.98%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       2.54%
                     % of reads unmapped: other |       1.10%

Based on these information, you need to write one bash script to extract the number of input reads, uniquely mapped reads number and multi-mapped reads number, and generate one summary file. Usually, the uniquely mapped ratio were used to measure the mapping quality. The sample do not pass the criteria should be labeled.

#!/usr/bin/bash
set -o nounset
set -o errexit

#echo "$OPTIND start at $OPTIND"
while getopts ":i:o:n:p:" optname; do
   case $optname in
      i)
         input="$OPTARG";;
      o)
         outputDir="$OPTARG";;
      n)
         cutoff="$OPTARG";;
      p)
         prefix="$OPTARG";;
      ?)
         echo "Usage: `basename $0` -i input -o outputDir -n cutoff -p prefix";;
      :)
         echo "No argument value for option $OPTARG";;
   esac
#   echo "$OPTIND is now $OPTIND"
#   echo $#
done;

# Initialize variables
if [ $# -eq 8 ]; then
outputDir="${outputDir%*/}"
echo "The input file is "`basename ${input}`
echo "The output directory in "${outputDir}
echo "The cutoff is "${cutoff}
echo "the prefix for output file is "${prefix}

# get values from input file
totalN=`cat ${input} | grep 'Number of input reads' | cut -f 2`
uniqN=`cat ${input} | grep 'Uniquely mapped reads number' | cut -f 2`
ratio=`bc <<< "scale=4; $uniqN/$totalN"`
multiN=`cat ${input} | grep 'Number of reads mapped to multiple loci' | cut -f 2`

if (( $(echo "$ratio > $cutoff" | bc -l) )); then
   echo "The mapping result is pass the cutoff"
   echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"pass"}' >> $outputDir/$prefix.sta
else
   echo "The mapping result is NOT pass the cutoff"
   echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"fail"}' >> $outputDir/$prefix.sta
fi

echo "Job finished!"
fi

Run the script:

for i in `ls 01.input/`;

do echo $i;

bash bin/sta.sh -i 01.input/$i/Log.final.out -o 02.output/ -n 0.9 -p summary;

done

Homework

level 1: type the code in your computer and understand the meaning for each command.

download link: Week_2_files/bash_example.zip

level 2: try to write one bash script to check the md5 of files in folder and output the file name of the truncated file.

> cd homework/checkMD5/
> ls
file1      file2      file3      file4      file5      md5sum.txt

> cat md5sum.txt
MD5 (./file1) = 15e8d15469ba992caac192b8684396a3
MD5 (./file2) = 85dc26fde0917b4dbba6339607b63d31
MD5 (./file3) = 62fdd313368e8125115ef53a3caaf95b
MD5 (./file4) = d09c83de50bb2da6c0c824fe44ba887a
MD5 (./file5) = db648585dd242f899818247643e599dd

download link: Week_2_files/homework/checkMD5.zip

level 3: try to write one bash script your own.

Reference

https://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html

results matching ""

    No results matching ""