bash: How to set up multiple jobs as a pipeline
1. Creating a script
You may like to create one directory like to hold your scripts
mkdir ~/scripts
, and to put this path to .bashrc
or .bash_profile
In order to ensure that no confusion can rise, bash script names often end in ".sh".
create one example bash scripts
vi ~/scripts/myBash1.sh
#!/bin/bash
clear
echo "This is information provided by mysystem.sh. Program starts now."
echo "Hello, $USER"
echo "Today's date is `date`, this is week `date +"%V"`."
echo "This is `uname -s` running on a `uname -m` processor."
echo "This is the uptime information:"
uptime
echo "I'm creating two variables"
USERS=`uptime | cut -d "," -f 3`
VALUE="4"
echo "There are$USERS have used this computer."
echo "This is the number: $VALUE"
(Tips: if you use vim, you may like to activate syntax highlighting, type ":syntax enable" in vim, you can add this setting to your .vimrc
file to make it permanent.)
2. Running a script
The script should have execute permissions for the correct owners in order to be runnable.
chmod u+x ~/scripts/myBash1.sh
type ~/scripts/myBash1.sh
, bash ~/scripts/myBash1.sh
or bash -x ~/scripts/myBash1.sh
to run the script.
> ~/scripts/myBash1.sh
This is information provided by mysystem.sh. Program starts now.
Hello, liyang
Today's date is Wed Mar 21 22:07:26 CST 2018, this is week 12.
This is Darwin running on a x86_64 processor.
This is the uptime information:
22:07 up 30 days, 12:42, 14 users, load averages: 1.41 1.67 1.74
I'm creating two variables
There are 14 users have used this computer.
This is the number: 4
3. Bash basics
3.1 Variables (page 299)
To set a variable in the shell, use
VARNAME="value"
Setting and exporting is usually done in one step:
export VARNAME="value"
3.2 Quoting characters (page 327)
- Escape characters:
echo $date
echo \$date
- Single quotes:
echo '$date'
- Double quotes:
echo "$date"
echo "date"
echo "I said: \"Hello World!\""
3.3 Shell expansion (page 325)
- Brace expansion {}:
echo sp{el,il,al}l
- Variable expansion $:
echo $SHELL
echo ${FRANKY:=Franky}
Command substitution:
echo $(date)
echodate
echo dateArithmetic expansion:
echo $((365*24))
echo $[365*24]
3.4 Regular expressions (page 346)
Operator | Effect |
---|---|
. | Matches any single character. |
? | The preceding item is optional and will be matched, at most, once. |
* | The preceding item will be matched zero or more times. |
+ | The preceding item will be matched one or more times. |
{N} | The preceding item is matched exactly N times. |
{N,} | The preceding item is matched N or more times. |
{N,M} | The preceding item is matched at least N times, but not more than M times. |
- | represents the range if it's not first or last in a list or the ending point of a range in a list. |
^ | Matches the empty string at the beginning of a line; also represents the characters not in the range of a list. |
$ | Matches the empty string at the end of a line. |
\b | Matches the empty string at the edge of a word. |
\B | Matches the empty string provided it's not at the edge of a word. |
\< | Match the empty string at the beginning of word. |
\> | Match the empty string at the end of word. |
3.5 grep, awk, sed, pipe(|), cut, sort, uniq, join, cat, paste (Week 1)
3.6 Conditional statements (page 379)
- general:
Primary | Meaning |
---|---|
[-a FILE] | True if FILE exists. |
[-b FILE] | True if FILE exists and is a block-special file. |
[-c FILE] | True if FILE exists and is a character-special file. |
[-d FILE] | True if FILE exists and is a directory. |
[-e FILE] | True if FILE exists. |
[-f FILE] | True if FILE exists and is a regular file. |
[-g FILE] | True if FILE exists and its SGID bit is set. |
[-h FILE] | True if FILE exists and is a symbolic link. |
[-k FILE] | True if FILE exists and its sticky bit is set. |
[-p FILE] | True if FILE exists and is a named pipe (FIFO). |
[-r FILE] | True if FILE exists and is readable. |
[-s FILE] | True if FILE exists and has a size greater than zero. |
[-t FD] | True if file descriptor FD is open and refers to a terminal. |
[-u FILE] | True if FILE exists and its SUID (set user ID) bit is set. |
[-w FILE] | True if FILE exists and is writable. |
[-x FILE] | True if FILE exists and is executable. |
[-O FILE] | True if FILE exists and is owned by the effective user ID. |
[-G FILE] | True if FILE exists and is owned by the effective group ID. |
[-L FILE] | True if FILE exists and is a symbolic link. |
[-N FILE] | True if FILE exists and has been modified since it was last read. |
[-S FILE] | True if FILE exists and is a socket. |
[FILE1 -nt FILE2] | True if FILE1 has been changed more recently than FILE2, or if FILE1 exists and FILE2 does not. |
[FILE1 -ot FILE2] | True if FILE1 is older than FILE2, or is FILE2 exists and FILE1 does not. |
[FILE1 -ef FILE2] | True if FILE1 and FILE2 refer to the same device and inode numbers. |
- for loop:
for NAME [in LIST ]; do COMMANDS; done
- while loop:
while CONTROL-COMMAND; do CONSEQUENT-COMMANDS; done
- until loop:
until TEST-COMMAND; do CONSEQUENT-COMMANDS; done
- break and continue
3.7 Functions
FUNCTION () { COMMANDS; }
4. Example
Assuming that you have 5 sequencing data, you are trying to check the mapping quality for each sample based on the output log. The output log is looks like:
> ll
drwxr-xr-x@ 7 liyang staff 238 Mar 22 01:08 01.input
drwxr-xr-x@ 3 liyang staff 102 Mar 22 01:24 02.output
-rw-r--r-- 1 liyang staff 145 Mar 22 01:27 README
drwxr-xr-x 3 liyang staff 102 Mar 22 01:26 bin
> cd 01.input
> ls
sample1 sample2 sample3 sample4 sample5
> cd sample1
> cat Log.final.out
Started job on | Dec 22 13:14:21
Started mapping on | Dec 22 14:10:01
Finished on | Dec 22 14:49:53
Mapping speed, Million of reads per hour | 29.16
Number of input reads | 19372132
Average input read length | 51
UNIQUE READS:
Uniquely mapped reads number | 17395896
Uniquely mapped reads % | 89.80%
Average mapped length | 50.93
Number of splices: Total | 5038
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 3677
Number of splices: GC/AG | 158
Number of splices: AT/AC | 8
Number of splices: Non-canonical | 1195
Mismatch rate per base, % | 0.25%
Deletion rate per base | 0.00%
Deletion average length | 1.48
Insertion rate per base | 0.00%
Insertion average length | 1.36
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1080034
% of reads mapped to multiple loci | 5.58%
Number of reads mapped to too many loci | 189582
% of reads mapped to too many loci | 0.98%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 2.54%
% of reads unmapped: other | 1.10%
Based on these information, you need to write one bash script to extract the number of input reads, uniquely mapped reads number and multi-mapped reads number, and generate one summary file. Usually, the uniquely mapped ratio were used to measure the mapping quality. The sample do not pass the criteria should be labeled.
#!/usr/bin/bash
set -o nounset
set -o errexit
#echo "$OPTIND start at $OPTIND"
while getopts ":i:o:n:p:" optname; do
case $optname in
i)
input="$OPTARG";;
o)
outputDir="$OPTARG";;
n)
cutoff="$OPTARG";;
p)
prefix="$OPTARG";;
?)
echo "Usage: `basename $0` -i input -o outputDir -n cutoff -p prefix";;
:)
echo "No argument value for option $OPTARG";;
esac
# echo "$OPTIND is now $OPTIND"
# echo $#
done;
# Initialize variables
if [ $# -eq 8 ]; then
outputDir="${outputDir%*/}"
echo "The input file is "`basename ${input}`
echo "The output directory in "${outputDir}
echo "The cutoff is "${cutoff}
echo "the prefix for output file is "${prefix}
# get values from input file
totalN=`cat ${input} | grep 'Number of input reads' | cut -f 2`
uniqN=`cat ${input} | grep 'Uniquely mapped reads number' | cut -f 2`
ratio=`bc <<< "scale=4; $uniqN/$totalN"`
multiN=`cat ${input} | grep 'Number of reads mapped to multiple loci' | cut -f 2`
if (( $(echo "$ratio > $cutoff" | bc -l) )); then
echo "The mapping result is pass the cutoff"
echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"pass"}' >> $outputDir/$prefix.sta
else
echo "The mapping result is NOT pass the cutoff"
echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"fail"}' >> $outputDir/$prefix.sta
fi
echo "Job finished!"
fi
Run the script:
for i in `ls 01.input/`;
do echo $i;
bash bin/sta.sh -i 01.input/$i/Log.final.out -o 02.output/ -n 0.9 -p summary;
done
Homework
level 1: type the code in your computer and understand the meaning for each command.
download link: Week_2_files/bash_example.zip
level 2: try to write one bash script to check the md5 of files in folder and output the file name of the truncated file.
> cd homework/checkMD5/
> ls
file1 file2 file3 file4 file5 md5sum.txt
> cat md5sum.txt
MD5 (./file1) = 15e8d15469ba992caac192b8684396a3
MD5 (./file2) = 85dc26fde0917b4dbba6339607b63d31
MD5 (./file3) = 62fdd313368e8125115ef53a3caaf95b
MD5 (./file4) = d09c83de50bb2da6c0c824fe44ba887a
MD5 (./file5) = db648585dd242f899818247643e599dd
download link: Week_2_files/homework/checkMD5.zip
level 3: try to write one bash script your own.
Reference
https://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html