SlideShare a Scribd company logo
This presentation is available under the Creative Commons
Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts
hereof.
Introduction to Linux
for Bioinformatics
Working the command line
Joachim Jacob
5 and 12 May 2014
2 of 81
Short recapitulation of last week
Bash only looks at certain directories for
commands/software/programs …
The location of a tool you can find with 'which'.
3 of 81
We can install and run software
● E.g. commands for mapping NGS data on the wiki:
http://wiki.bits.vib.be/index.php/GenomeView_Workshop:_Mapping_excerc
ises#Mapping
4 of 81
Software installation directories
Contain the commands we can execute in the terminal
5 of 81
Software installation directories
How does the terminal know where to look for
executables? (e.g. how does it know bowtie is
in /usr/bin?)
6 of 81
Environment variables
A variable is a word that represents/contains a
value or string. Environment variables describe
your system. Fictive example:
TREES are green on
my system
Program that
Draws trees.
TREES=green
Environment variable
7 of 81
Programs use env variables
TREES=purple
TREES = purple
on my system
Program that
Draws trees.
Depending on how environment variables are set,
programs can change their behaviour.
8 of 81
'env' displays environment vars
9 of 81
Programs need to be in the PATH
● Environment variables: they contain
values, applicable system wide.
●
●
https://help.ubuntu.com/community/EnvironmentVariables
10 of 81
The PATH environment variable
PATH contains a set of directories, separated by ':'
$ echo $PATH
/home/joachim/bin:/usr/local/sbin:/usr/local/bin:/usr/
sbin:/usr/bin:/sbin:/bin:/usr/games
11 of 81
Installing is just placing the executable
1. You copy the executable to one of the folders
in PATH
2. You create a sym(bolic) link to an executable
in the one of the folders in PATH
(see previous week)
3. You add a directory to the PATH variable
$ sudo cp /home/joachim/Downloads/tSNE /usr/local/bin
$ sudo chmod +x /usr/local/bin/tSNE
12 of 81
3. Add a directory to the PATH
Export <environment_variable_name>=<value>
13 of 81
Env variables are stored in a text file
$ cat /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin
:/bin:/usr/games"
/etc is the directory that contains configuration text
files. It is only owned by root: system-wide settings.
A 'normal' user (session-wide settings) can create the
file ~/.pam_environment to set the vars with content
14 of 81
Recap: editing files
Create a text file with the name .pam_environment
and open in an editor:
$ nano .pam_environment
→ quit by pressing ctrl-x
$ gedit .pam_environment
→ graphical
$ vim .pam_environment
→ quit by pressing one after the other :q!
15 of 81
Create .pam_environment
In ~/.pam_environment, type:
TREES DEFAULT=green
Save the file. Log out and log back in.
16 of 81
Bash variables are limited in scope
You can assign any variable you like in bash, like:
The name of the variable can be any normal string.
This variable exists only in this terminal. The
command echo can display the value assigned to that
variable. The value of a variable is referred to by $
{varname} or $varname.
17 of 81
It can be used in scripts!
All commands you type, you can put one after the
other in a text file, and let bash execute it.
Let's try!
Make a file in your ~ called 'space_left':
Enter two following bash commands in this file:
df -h .
du -sh */
18 of 81
Running our first bash script
19 of 81
The shebang
Simple text files become Bash scripts when adding
a shebang line as first line, saying which program
should read and execute this text file.
#!/bin/bash
#!/usr/bin/perl
#!/usr/bin/python
(see our other trainings for perl and python)
20 of 81
Things to remember
● Linux determines files types based on its
content (not extension).
● Change permissions of scripts to read and
execute to allow running in the command line:
$ chmod +x filename
21 of 81
Exercise
→ A simple bash script
22 of 81
Can you reconstruct the script?
.......................................
.......................................
.......................................
.......................................
23 of 81
One slide on permissions
$ chown user:group filename
$ chmod [ugo][+-][rwx] filename
or
$ chmod [0-7][0-7][0-7] filename
1 stands for execute
2 stands for write
4 stands for read
→ any number from 0 to 7 is a unique
combination of 1, 2 and 4.
24 of 81
Passing arguments to bash scripts
We can pass on arguments to our scripts: they are
subsequently stored in variables called $1, $2, $3,...
Make a file called 'arguments.sh' with following
contents (copy paste is fine – be aware of the “):
#!/bin/bash
firstarg=$1
secondarg=$2
echo “You have entered ”$firstarg”
and ”$secondarg””
25 of 81
Passing arguments to bash scripts
Make your script executable.
$ chmod +x arguments.sh
26 of 81
Passing arguments to bash scripts
Let's try to look at it, and run it.
27 of 81
Arguments are separated by white spaces
The string after the command is chopped on the
white spaces. Different cases (note the “ and ):
28 of 81
Arguments are separated by white spaces
The string after the command is chopped on the
white spaces. Different cases (note the “ and ):
29 of 81
Useful example in bioinformatics
For example, look at the script on our wiki:
http://wiki.bits.vib.be/index.php/Bfast
Lines starting with # are ignored.
30 of 81
Chaining command line tools
This is the ultimate power of Unix-like OSes. The
philosophy is that every tool should do one
small specific task. By combining tools we can
create a bigger piece of software fulfilling our
needs.
How combining different tools?
1. writing scripts
2. pipes
31 of 81
Chaining the output to input
What the programs take in, and what they print out...
32 of 81
Chaining the output to input
We can take the output of one program, store it,
and use it as input for another program
~ Assembly line
33 of 81
Deliverance through channels
When a program is executed, 3 channels are opened:
● stdin: an input channel – what is read by the program
● stdout: channel used for functional output
● stderr: channel used for error reporting
In UNIX, open files have an identification number
called a file descriptor:
● stdin called by 0
● stdout called by 1
● stderr called by 2
(*) by convention
http://www.linux-tutorial.info/modules.php?name=MContent&pageid=2
1
34 of 81
I/O redirection of terminal programs
$ cat --help
Usage: cat [OPTION]... [FILE]...
Concatenate FILE(s), or standard input,
to standard output.
“When cat is
run it waits
for input. As
soon as an
enter is entered
output is written
to STDOUT.”
cat
STDIN
or channel 0
STDOUT
or channel 1
STDERR
or channel 2
35 of 81
I/O redirection
When cat is launched without any arguments, the
program reads from stdin (keyboard) and writes to
stdout (terminal).
Example:
$ cat
type:
DNA: National Dyslexia Association↵
result:
DNA: National Dyslexia Association
You can stop the program using the 'End Of Input'
character CTRL-D
36 of 81
I/O redirection of terminal programs
Takes input from the keyboard, attached to
STDIN.
cat
1 2
0
STDIN
or channel 0
STDOUT
or channel 1
STDERR
or channel 2
37 of 81
I/O redirection of terminal programs
Takes input from files, which is attached to
STDIN
cat
1 2
0
STDERR
or channel 2
STDOUT
or channel 1
38 of 81
I/O redirection of terminal programs
Connect a file to STDIN:
$ cat 0< file
or shorter:
$ cat < file
or even shorter (and most used – what we know
already)
$ cat file
Example:
$ cat ~/arguments.sh
Try also:
$ cat 0<arguments.sh
39 of 81
Output redirection
Can write output to files, instead of the
terminal
cat
1 2
0
STDERR
or channel 2
STDOUT
or channel 1
40 of 81
Output redirection
 The stdout output of a program can be
saved to a file (or device):
$ cat 1> file
or short:
$ cat > file
 Examples:
$ ls -lR / > /tmp/ls-lR
$ less /tmp/ls-lR
41 of 81
Chaining the output to input
You have noticed that running:
$ ls -lR / > /tmp/ls-lR
outputs some warnings/errors on the screen: this
is all output of STDERR (note: channel 1 is
redirected to a file, leaving only channel 2 to the
terminal)
42 of 81
Chaining the output to input
Redirect the errors to a file called 'error.txt'.
$ ls -lR /
?
43 of 81
Chaining the output to input
Redirect the error channel to a file error.txt.
$ ls -lR / 2 > error.txt
$ less error.txt
44 of 81
Beware of overwriting output
IMPORTANT, if you write to a file, the contents
are being replaced by the output.
To append to file, you use:
$ cat 1>> file
or short
$ cat >> file
Example:
$ echo “Hello” >> append.txt
$ echo “World” >> append.txt
$ cat append.txt
45 of 81
Chaining the output to input
INPUT OUTPUT
Input from file < filename > filename Output to a file
Input until
string EOF
<< EOF >> filename Append output to a
file
Input directly
from string
<<< “This string is read”
46 of 81
Special devices
● For input:
/dev/zero all zeros
/dev/urandom (pseudo) random numbers
● For output:
/dev/null 'bit-heaven'
Example:
You are not interested in the errors from the a
certain command: send them to /dev/null !
$ ls -alh / 2> /dev/null
47 of 81
Summary of output redirection
● Error direction to /dev/null
● The program can run on its
own: input from file, output to
file.
cat
1 2
0
/dev/null
48 of 81
Plumbing with in- and outputs
● Example:
$ ls -lR ~ > /tmp/ls-lR
$ less < /tmp/ls-lR ('Q' to quit)
$ rm -rf /tmp/ls-lR
49 of 81
Plumbing with in- and outputs
● Example:
$ ls -lR ~ > /tmp/ls-lR
$ less < /tmp/ls-lR ('Q' to quit)
$ rm -rf /tmp/ls-lR
can be shortened to:
$ ls -lR ~ | less
(Formally speaking: the stdout channel of ls is connected to the
stdin channel of less)
50 of 81
Combining pipe after pipe
Pipes can pass the output
of a command to another
command, which on his
turn can pipe it through,
until the final output is reached.
$ history | awk '{ print $2 }' 
| sort | uniq -c | sort -nr | head -3
237 ls
180 cd
103 ll
51 of 81
Text tools combine well with pipes
UNIX has an extensive toolkit for text analysis:
● Extraction: head, tail, grep, awk, uniq
● Reporting: wc
● Manipulation: dos2unix, sort, tr, sed
But, the UNIX tool for heavy text parsing is perl (see
https://www.bits.vib.be/index.php/training/175-perl)
52 of 81
Grep: filter lines from text
grep extracts lines that match a string.
Syntax:
$ grep [options] regex [file(s)]
The file(s) are read line by line. If the line matches
the given criteria, the entire line is written to
stdout.
53 of 81
Grep example
A GFF file contains genome annotation
information. Different types of annotations are
mixed: gene, mRNA, exons, …
Filtering out one type of annotation is very easy
with grep.
Task:
Filter out all lines from locus Os01g01070 in
all.gff3 (should be somewhere in your Rice folder).
54 of 81
Grep example
55 of 81
Grep
-i: ignore case
matches the regex case insensitively
-v: inverse
shows all lines that do not match the regex
-l: list
shows only the name of the files that contain a
match
-n:
shows n lines around the match
--color:
highlights the match
56 of 81
Finetuning filtering with regexes
A regular expression, aka regex, is a formal way of
describing sets of strings, used by many tools: grep,
sed, awk, ...
It resembles wild card-functionility (e.g. ls *) (also
called globbing), but is more extensive.
http://www.faqs.org/docs/abs/HTML/regexp.html
http://www.slideshare.net/wvcrieki/bioinformatics-p2p3perlregexes-v2013wimvancriekinge?utm_source=slideshow&utm_medium=ssemail&u
57 of 81
Basics of regexes
A plain character in a regex matches itself.
. = any character
^ = beginning of the line
$ = end of the line
[ ] = a set of characters
Example:
$ grep chr[1-5] all.gff3
Regex chr chr[1-5] chr. AAF12.[1-3] AT[1,5]G[:digit:]+/.[1,2]
Matching
string set
chr1 chr1 chr1 AAF12.1 AT5G08160.1
chr2 chr2 chr2 AAF12.2 AT5G08160.2
chr3 chr3 chr3 AAF12.3 AT5G10245.1
chr4 chr4 chr4 AT1G14525.1
chr5 chr5 chr5
58 of 81
Basics of regexes
Example: from TAIR9_mRNA.bed, filter out the mRNA
structures from chr1 and only on the + strand.
$ egrep '^chr1.++' TAIR9_mRNA.bed > out.txt
^ matches
the start of
a string
^chr1
Matches lines
With 'chr1' appearing
At the beginning
. matches
any char
.+
matches
any string
Since + is a special character
(standing for a repeat of one or more),
we need to escape it.
+ matches a '+' symbol as such
Together in this order, the regex
filters out lines of chr1 on + strand
59 of 81
Finetuning filtering with regexes
$ egrep '^chr1.++' TAIR9_mRNA.bed > out.txt
chr1 2025600 2027271 AT1G06620.10 + 2025617 2027094 0 3541,322,4
chr1 16269074 16270513 AT1G43171.10 + 1626998816270327 0
chr1 28251959 28253619 AT1G75280.10 + 2825202928253355 0
chr1 693479 696382 AT1G03010.10 + 693479 696188 0 592,67,119
http://www.faqs.org/docs/abs/HTML/regexp.html
http://www.slideshare.net/wvcrieki/bioinformatics-p2p3perlregexes-v2013wimvancriekinge?utm_source=slideshow&utm_medium=ssemail&u
60 of 81
Wc – count words in files
A general tool for counting lines, words and characters:
wc [options] file(s)
c: show number of characters
w: show number of words
l: show number of lines
How many mRNA entries are on chr1 of A. thaliana?
$ wc -l chr1_TAIR9_mRNA.bed
or
$ grep chr1 TAIR9_mRNA.bed | wc -l
61 of 81
Translate
To replace characters:
$ tr 's1' 's2'
! tr always reads from stdin – you cannot specify any
files as command line arguments. Characters in s1
are replaced by characters in s2.
Example:
$ echo 'James Watson' | tr '[a-z]' '[A-Z]'
JAMES WATSON
62 of 81
Delete characters
To remove a particular set of characters:
$ tr -d 's1'
Deletes all characters in s1
Example:
$ tr –d 'r' < DOStext > UNIXtext
63 of 81
awk can extract and rearrange
… specific fields and do calculations and
manipulations on it.
awk -F delim '{ print $x }'
● -F delim: the field separator (default is white space)
● $x the field number:
$0: the complete line
$1: first field
$2: second field
…
NF is the cumber of fields (can also be taken for last
field).
64 of 81
awk can extract and rearrange
For example: TAIR9_mRNA.bed needs to be converted to
.gff (general feature format). See the .gff format
http://wiki.bits.vib.be/index.php/.gff
With AWK this can easily be done! One line of .bed looks
like:
→ needs to be one line of .gff
chr1 2025600 2027271 AT1G06620.10 + 2025617 2027094 0 3541,322,429, 0,
65 of 81
awk can extract and rearrange
$ awk '{print $1”tawktmRNAt”$2”t”$3”t” 
$5”t”$6”t0t”$4 }' TAIR9_mRNA.bed
chr1 2025600 2027271 AT1G06620.10 + 2025617 2027094 0 3541,322,429, 0
66 of 81
Awk has also a filtering option
Extraction of one or more fields from a tabular data
stream of lines that match a given regex:
awk -F delim '/regex/ { print $x }'
Here is:
● Delim: the delimiter in the file
● regex: a regular expression
The awk script is executed only if the line matches
regex lines that do not match regex are removed
from the stream
Further excellent documentation on awk:
http://www.grymoire.com/Unix/Awk.html
67 of 81
Cut selects columns
Cut extracts fields from text files:
● Using fixed delimiter
$ cut [-d delim] -f <fields> [file]
● chopping on fixed width
$ cut -c <fields> [file]
For <fields>:
N the Nth element
N-M element the Nth till the Mth element
N- from the Nth element on
-M till the Mth element
The first element is 1.
68 of 81
Cutting columns from text files
Fixed width example:
Suppose there is a file fixed.txt with content
12345ABCDE67890FGHIJ
To extract a range of characters:
$ cut -c 6-10 fixed.txt
ABCDE
69 of 81
Sorting output
To sort alphabetically or numerically lines of text:
$ sort [options] file(s)
When more files are specified, they are read one by
one, but all lines together are sorted.
70 of 81
Sorting options
● n sort numerically
● f fold – case-insensitive
● r reverse sort order
● ts use s as field separator (instead of space)
● kn sort on the n-th field (1 being the first field)
Example: sort mRNA by chromosome number and
next by number of exons.
$ sort -n -k1 -k10 TAIR9_mRNA.bed > 
out.bed
71 of 81
Detecting unique records with uniq
● eliminate duplicate lines in a set of files
● display unique lines
● display and count duplicate lines
Very important: uniq always needs from sorted
input.
Useful option:
-c count the number of fields.
72 of 81
Eliminate duplicates
● Example:
$ who
root tty1 Oct 16 23:20
james tty2 Oct 16 23:20
james pts/0 Oct 16 23:21
james pts/1 Oct 16 23:22
james pts/2 Oct 16 23:22
$ who | awk '{print $1}' | sort | uniq
james
root
73 of 81
Display unique or duplicate lines
● To display lines that occur only once:
$ uniq -u file(s)
● To display lines that occur more than once:
$ uniq -d file(s)
Example:
$ who|awk '{print $1}'|sort|uniq -d
james
● To display the counts of the lines
$ uniq -c file(s)
Example
$ who | awk '{print $1}' | sort | uniq -c
4 james
1 root
74 of 81
Edit per line with sed
Sed (the stream editor) can make changes in text
per line. It works on files or on STDIN.
See http://www.grymoire.com/Unix/Sed.html
This is also a very big tool, but we will only look to
the substitute function (the most used one).
$ sed -e 's/r1/s1/' file(s)
s: the substitute command
/: separator
r1: regex to be replaced
s1: text that will replace the regex match
75 of 81
Paste several lines together
Paste allows you to concatenate every n lines into one
line, ideal for manipulating fastq files.
We can use sed for this together with paste.
$ paste - - - - < in.fq | 
cut -f 1,2 | 
sed 's/^@/>/' | 
tr "t" "n" > out.fa
http://thegenomefactory.blogspot.be/2012/05/cool-use-of-unix-paste-with-ngs.html
76 of 81
Transpose is not a standard tool
http://sourceforge.net/projects/transpose/
But it is extremely useful. It transposes tabular text
files, exchanging columns for row and vice versa.
77 of 81
Building your pipelines
grep
tr
sed awk
sort
wc uniq
uniq
transpose
paste
78 of 81
Fill in the tools
Filter on lines ánd select different columns: …....................
Merge identical
fields: …...........
Filter lines: …............
Sort lines: ….....
Select columns: …............
Replace characters: ….....
Replace strings: …............
Transpose: ….....Numerical summary: ….....
79 of 81
Exercise
→ Text manipulation exercises
80 of 81
Keywords
Environment variable
PATH
shebang
script
argument
STDIN
pipe
comment
Write in your own words what the terms mean
81 of 81
Break

More Related Content

PDF
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
PDF
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
PDF
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
PDF
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
PDF
The structure of Linux - Introduction to Linux for bioinformatics
PDF
Managing your data - Introduction to Linux for bioinformatics
PDF
Introduction to Linux for bioinformatics
PDF
Text mining on the command line - Introduction to linux for bioinformatics
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
The structure of Linux - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
Introduction to Linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics

What's hot (20)

ODP
Linux Introduction (Commands)
PDF
Productivity tips - Introduction to linux for bioinformatics
PPT
A Quick Introduction to Linux
PPTX
Linux basics part 1
PPT
PPTX
Linux Command Suumary
PDF
50 most frequently used unix linux commands (with examples)
PPT
Linux Commands
PPT
Linux presentation
ODP
Linux commands
PPT
Linux Administration
PDF
LINUX Admin Quick Reference
DOCX
Linux basic commands tutorial
PPTX
Unix Linux Commands Presentation 2013
PPT
Linux commands
PDF
An Introduction To Linux
PPTX
Unix OS & Commands
PPTX
Linux Presentation
PPTX
Terminal Commands (Linux - ubuntu) (part-1)
PDF
Linux Command Line Basics
Linux Introduction (Commands)
Productivity tips - Introduction to linux for bioinformatics
A Quick Introduction to Linux
Linux basics part 1
Linux Command Suumary
50 most frequently used unix linux commands (with examples)
Linux Commands
Linux presentation
Linux commands
Linux Administration
LINUX Admin Quick Reference
Linux basic commands tutorial
Unix Linux Commands Presentation 2013
Linux commands
An Introduction To Linux
Unix OS & Commands
Linux Presentation
Terminal Commands (Linux - ubuntu) (part-1)
Linux Command Line Basics
Ad

Viewers also liked (17)

PPT
BITS: Introduction to Linux - Text manipulation tools for bioinformatics
DOCX
Quize on scripting shell
PDF
Linux Beginner Guide 2014
DOCX
Linux crontab
PDF
Shell Scripting With Arguments
PPTX
Job Automation using Linux
PDF
Processor grafxtron
PPTX
Asynchronous Web Programming with HTML5 WebSockets and Java
PPTX
Bash Shell Scripting
PPT
Shell Scripting
KEY
Linux beginner's Workshop
PPT
Bash shell
PPT
Unix Shell Scripting Basics
PPTX
Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
PPT
Unix/Linux Basic Commands and Shell Script
PPT
Linux command ppt
PDF
Shell Scripting
BITS: Introduction to Linux - Text manipulation tools for bioinformatics
Quize on scripting shell
Linux Beginner Guide 2014
Linux crontab
Shell Scripting With Arguments
Job Automation using Linux
Processor grafxtron
Asynchronous Web Programming with HTML5 WebSockets and Java
Bash Shell Scripting
Shell Scripting
Linux beginner's Workshop
Bash shell
Unix Shell Scripting Basics
Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
Unix/Linux Basic Commands and Shell Script
Linux command ppt
Shell Scripting
Ad

Similar to Part 5 of "Introduction to Linux for Bioinformatics": Working the command line's text tools (20)

PDF
Shell scripting
PDF
Shell scripting1232232312312312312312312
PDF
Shell-Scripting-1.pdf
PPTX
Licão 09 variables and arrays v2
PPT
101 3.1 gnu and unix commands
PDF
Chapter 1: Introduction to Command Line
PPTX
Chapter 1: Introduction to Command Line
DOCX
lec4.docx
PPT
04 using and_configuring_bash
PPT
intro unix/linux 03
PPT
390aLecture05_12sp.ppt
PDF
Unleash your inner console cowboy
PPTX
Shell scripting
PDF
Unit 11 configuring the bash shell – shell script
PPTX
PDF
Unleash your inner console cowboy
PPT
Using Unix
PPT
Spsl by sasidhar 3 unit
Shell scripting
Shell scripting1232232312312312312312312
Shell-Scripting-1.pdf
Licão 09 variables and arrays v2
101 3.1 gnu and unix commands
Chapter 1: Introduction to Command Line
Chapter 1: Introduction to Command Line
lec4.docx
04 using and_configuring_bash
intro unix/linux 03
390aLecture05_12sp.ppt
Unleash your inner console cowboy
Shell scripting
Unit 11 configuring the bash shell – shell script
Unleash your inner console cowboy
Using Unix
Spsl by sasidhar 3 unit

More from Joachim Jacob (8)

ODP
Korte handleiding van de Partago app
ODP
Blaas nieuw leven in je PC met Linux
ODP
The Galaxy toolshed
PDF
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
PDF
Part 5 of RNA-seq for DE analysis: Detecting differential expression
PDF
Part 2 of RNA-seq for DE analysis: Investigating raw data
PDF
Part 1 of RNA-seq for DE analysis: Defining the goal
PDF
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Korte handleiding van de Partago app
Blaas nieuw leven in je PC met Linux
The Galaxy toolshed
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 4 of RNA-seq for DE analysis: Extracting count table and QC

Recently uploaded (20)

PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPTX
C1 cut-Methane and it's Derivatives.pptx
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Overview of calcium in human muscles.pptx
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
Application of enzymes in medicine (2).pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
Seminar Hypertension and Kidney diseases.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Fluid dynamics vivavoce presentation of prakash
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
C1 cut-Methane and it's Derivatives.pptx
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Introcution to Microbes Burton's Biology for the Health
Overview of calcium in human muscles.pptx
TOTAL hIP ARTHROPLASTY Presentation.pptx
7. General Toxicologyfor clinical phrmacy.pptx
lecture 2026 of Sjogren's syndrome l .pdf
Phytochemical Investigation of Miliusa longipes.pdf
The Land of Punt — A research by Dhani Irwanto
Application of enzymes in medicine (2).pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Seminar Hypertension and Kidney diseases.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Introduction to Cardiovascular system_structure and functions-1
Biophysics 2.pdffffffffffffffffffffffffff
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
. Radiology Case Scenariosssssssssssssss
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Fluid dynamics vivavoce presentation of prakash

Part 5 of "Introduction to Linux for Bioinformatics": Working the command line's text tools

  • 1. This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. Introduction to Linux for Bioinformatics Working the command line Joachim Jacob 5 and 12 May 2014
  • 2. 2 of 81 Short recapitulation of last week Bash only looks at certain directories for commands/software/programs … The location of a tool you can find with 'which'.
  • 3. 3 of 81 We can install and run software ● E.g. commands for mapping NGS data on the wiki: http://wiki.bits.vib.be/index.php/GenomeView_Workshop:_Mapping_excerc ises#Mapping
  • 4. 4 of 81 Software installation directories Contain the commands we can execute in the terminal
  • 5. 5 of 81 Software installation directories How does the terminal know where to look for executables? (e.g. how does it know bowtie is in /usr/bin?)
  • 6. 6 of 81 Environment variables A variable is a word that represents/contains a value or string. Environment variables describe your system. Fictive example: TREES are green on my system Program that Draws trees. TREES=green Environment variable
  • 7. 7 of 81 Programs use env variables TREES=purple TREES = purple on my system Program that Draws trees. Depending on how environment variables are set, programs can change their behaviour.
  • 8. 8 of 81 'env' displays environment vars
  • 9. 9 of 81 Programs need to be in the PATH ● Environment variables: they contain values, applicable system wide. ● ● https://help.ubuntu.com/community/EnvironmentVariables
  • 10. 10 of 81 The PATH environment variable PATH contains a set of directories, separated by ':' $ echo $PATH /home/joachim/bin:/usr/local/sbin:/usr/local/bin:/usr/ sbin:/usr/bin:/sbin:/bin:/usr/games
  • 11. 11 of 81 Installing is just placing the executable 1. You copy the executable to one of the folders in PATH 2. You create a sym(bolic) link to an executable in the one of the folders in PATH (see previous week) 3. You add a directory to the PATH variable $ sudo cp /home/joachim/Downloads/tSNE /usr/local/bin $ sudo chmod +x /usr/local/bin/tSNE
  • 12. 12 of 81 3. Add a directory to the PATH Export <environment_variable_name>=<value>
  • 13. 13 of 81 Env variables are stored in a text file $ cat /etc/environment PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin :/bin:/usr/games" /etc is the directory that contains configuration text files. It is only owned by root: system-wide settings. A 'normal' user (session-wide settings) can create the file ~/.pam_environment to set the vars with content
  • 14. 14 of 81 Recap: editing files Create a text file with the name .pam_environment and open in an editor: $ nano .pam_environment → quit by pressing ctrl-x $ gedit .pam_environment → graphical $ vim .pam_environment → quit by pressing one after the other :q!
  • 15. 15 of 81 Create .pam_environment In ~/.pam_environment, type: TREES DEFAULT=green Save the file. Log out and log back in.
  • 16. 16 of 81 Bash variables are limited in scope You can assign any variable you like in bash, like: The name of the variable can be any normal string. This variable exists only in this terminal. The command echo can display the value assigned to that variable. The value of a variable is referred to by $ {varname} or $varname.
  • 17. 17 of 81 It can be used in scripts! All commands you type, you can put one after the other in a text file, and let bash execute it. Let's try! Make a file in your ~ called 'space_left': Enter two following bash commands in this file: df -h . du -sh */
  • 18. 18 of 81 Running our first bash script
  • 19. 19 of 81 The shebang Simple text files become Bash scripts when adding a shebang line as first line, saying which program should read and execute this text file. #!/bin/bash #!/usr/bin/perl #!/usr/bin/python (see our other trainings for perl and python)
  • 20. 20 of 81 Things to remember ● Linux determines files types based on its content (not extension). ● Change permissions of scripts to read and execute to allow running in the command line: $ chmod +x filename
  • 21. 21 of 81 Exercise → A simple bash script
  • 22. 22 of 81 Can you reconstruct the script? ....................................... ....................................... ....................................... .......................................
  • 23. 23 of 81 One slide on permissions $ chown user:group filename $ chmod [ugo][+-][rwx] filename or $ chmod [0-7][0-7][0-7] filename 1 stands for execute 2 stands for write 4 stands for read → any number from 0 to 7 is a unique combination of 1, 2 and 4.
  • 24. 24 of 81 Passing arguments to bash scripts We can pass on arguments to our scripts: they are subsequently stored in variables called $1, $2, $3,... Make a file called 'arguments.sh' with following contents (copy paste is fine – be aware of the “): #!/bin/bash firstarg=$1 secondarg=$2 echo “You have entered ”$firstarg” and ”$secondarg””
  • 25. 25 of 81 Passing arguments to bash scripts Make your script executable. $ chmod +x arguments.sh
  • 26. 26 of 81 Passing arguments to bash scripts Let's try to look at it, and run it.
  • 27. 27 of 81 Arguments are separated by white spaces The string after the command is chopped on the white spaces. Different cases (note the “ and ):
  • 28. 28 of 81 Arguments are separated by white spaces The string after the command is chopped on the white spaces. Different cases (note the “ and ):
  • 29. 29 of 81 Useful example in bioinformatics For example, look at the script on our wiki: http://wiki.bits.vib.be/index.php/Bfast Lines starting with # are ignored.
  • 30. 30 of 81 Chaining command line tools This is the ultimate power of Unix-like OSes. The philosophy is that every tool should do one small specific task. By combining tools we can create a bigger piece of software fulfilling our needs. How combining different tools? 1. writing scripts 2. pipes
  • 31. 31 of 81 Chaining the output to input What the programs take in, and what they print out...
  • 32. 32 of 81 Chaining the output to input We can take the output of one program, store it, and use it as input for another program ~ Assembly line
  • 33. 33 of 81 Deliverance through channels When a program is executed, 3 channels are opened: ● stdin: an input channel – what is read by the program ● stdout: channel used for functional output ● stderr: channel used for error reporting In UNIX, open files have an identification number called a file descriptor: ● stdin called by 0 ● stdout called by 1 ● stderr called by 2 (*) by convention http://www.linux-tutorial.info/modules.php?name=MContent&pageid=2 1
  • 34. 34 of 81 I/O redirection of terminal programs $ cat --help Usage: cat [OPTION]... [FILE]... Concatenate FILE(s), or standard input, to standard output. “When cat is run it waits for input. As soon as an enter is entered output is written to STDOUT.” cat STDIN or channel 0 STDOUT or channel 1 STDERR or channel 2
  • 35. 35 of 81 I/O redirection When cat is launched without any arguments, the program reads from stdin (keyboard) and writes to stdout (terminal). Example: $ cat type: DNA: National Dyslexia Association↵ result: DNA: National Dyslexia Association You can stop the program using the 'End Of Input' character CTRL-D
  • 36. 36 of 81 I/O redirection of terminal programs Takes input from the keyboard, attached to STDIN. cat 1 2 0 STDIN or channel 0 STDOUT or channel 1 STDERR or channel 2
  • 37. 37 of 81 I/O redirection of terminal programs Takes input from files, which is attached to STDIN cat 1 2 0 STDERR or channel 2 STDOUT or channel 1
  • 38. 38 of 81 I/O redirection of terminal programs Connect a file to STDIN: $ cat 0< file or shorter: $ cat < file or even shorter (and most used – what we know already) $ cat file Example: $ cat ~/arguments.sh Try also: $ cat 0<arguments.sh
  • 39. 39 of 81 Output redirection Can write output to files, instead of the terminal cat 1 2 0 STDERR or channel 2 STDOUT or channel 1
  • 40. 40 of 81 Output redirection  The stdout output of a program can be saved to a file (or device): $ cat 1> file or short: $ cat > file  Examples: $ ls -lR / > /tmp/ls-lR $ less /tmp/ls-lR
  • 41. 41 of 81 Chaining the output to input You have noticed that running: $ ls -lR / > /tmp/ls-lR outputs some warnings/errors on the screen: this is all output of STDERR (note: channel 1 is redirected to a file, leaving only channel 2 to the terminal)
  • 42. 42 of 81 Chaining the output to input Redirect the errors to a file called 'error.txt'. $ ls -lR / ?
  • 43. 43 of 81 Chaining the output to input Redirect the error channel to a file error.txt. $ ls -lR / 2 > error.txt $ less error.txt
  • 44. 44 of 81 Beware of overwriting output IMPORTANT, if you write to a file, the contents are being replaced by the output. To append to file, you use: $ cat 1>> file or short $ cat >> file Example: $ echo “Hello” >> append.txt $ echo “World” >> append.txt $ cat append.txt
  • 45. 45 of 81 Chaining the output to input INPUT OUTPUT Input from file < filename > filename Output to a file Input until string EOF << EOF >> filename Append output to a file Input directly from string <<< “This string is read”
  • 46. 46 of 81 Special devices ● For input: /dev/zero all zeros /dev/urandom (pseudo) random numbers ● For output: /dev/null 'bit-heaven' Example: You are not interested in the errors from the a certain command: send them to /dev/null ! $ ls -alh / 2> /dev/null
  • 47. 47 of 81 Summary of output redirection ● Error direction to /dev/null ● The program can run on its own: input from file, output to file. cat 1 2 0 /dev/null
  • 48. 48 of 81 Plumbing with in- and outputs ● Example: $ ls -lR ~ > /tmp/ls-lR $ less < /tmp/ls-lR ('Q' to quit) $ rm -rf /tmp/ls-lR
  • 49. 49 of 81 Plumbing with in- and outputs ● Example: $ ls -lR ~ > /tmp/ls-lR $ less < /tmp/ls-lR ('Q' to quit) $ rm -rf /tmp/ls-lR can be shortened to: $ ls -lR ~ | less (Formally speaking: the stdout channel of ls is connected to the stdin channel of less)
  • 50. 50 of 81 Combining pipe after pipe Pipes can pass the output of a command to another command, which on his turn can pipe it through, until the final output is reached. $ history | awk '{ print $2 }' | sort | uniq -c | sort -nr | head -3 237 ls 180 cd 103 ll
  • 51. 51 of 81 Text tools combine well with pipes UNIX has an extensive toolkit for text analysis: ● Extraction: head, tail, grep, awk, uniq ● Reporting: wc ● Manipulation: dos2unix, sort, tr, sed But, the UNIX tool for heavy text parsing is perl (see https://www.bits.vib.be/index.php/training/175-perl)
  • 52. 52 of 81 Grep: filter lines from text grep extracts lines that match a string. Syntax: $ grep [options] regex [file(s)] The file(s) are read line by line. If the line matches the given criteria, the entire line is written to stdout.
  • 53. 53 of 81 Grep example A GFF file contains genome annotation information. Different types of annotations are mixed: gene, mRNA, exons, … Filtering out one type of annotation is very easy with grep. Task: Filter out all lines from locus Os01g01070 in all.gff3 (should be somewhere in your Rice folder).
  • 54. 54 of 81 Grep example
  • 55. 55 of 81 Grep -i: ignore case matches the regex case insensitively -v: inverse shows all lines that do not match the regex -l: list shows only the name of the files that contain a match -n: shows n lines around the match --color: highlights the match
  • 56. 56 of 81 Finetuning filtering with regexes A regular expression, aka regex, is a formal way of describing sets of strings, used by many tools: grep, sed, awk, ... It resembles wild card-functionility (e.g. ls *) (also called globbing), but is more extensive. http://www.faqs.org/docs/abs/HTML/regexp.html http://www.slideshare.net/wvcrieki/bioinformatics-p2p3perlregexes-v2013wimvancriekinge?utm_source=slideshow&utm_medium=ssemail&u
  • 57. 57 of 81 Basics of regexes A plain character in a regex matches itself. . = any character ^ = beginning of the line $ = end of the line [ ] = a set of characters Example: $ grep chr[1-5] all.gff3 Regex chr chr[1-5] chr. AAF12.[1-3] AT[1,5]G[:digit:]+/.[1,2] Matching string set chr1 chr1 chr1 AAF12.1 AT5G08160.1 chr2 chr2 chr2 AAF12.2 AT5G08160.2 chr3 chr3 chr3 AAF12.3 AT5G10245.1 chr4 chr4 chr4 AT1G14525.1 chr5 chr5 chr5
  • 58. 58 of 81 Basics of regexes Example: from TAIR9_mRNA.bed, filter out the mRNA structures from chr1 and only on the + strand. $ egrep '^chr1.++' TAIR9_mRNA.bed > out.txt ^ matches the start of a string ^chr1 Matches lines With 'chr1' appearing At the beginning . matches any char .+ matches any string Since + is a special character (standing for a repeat of one or more), we need to escape it. + matches a '+' symbol as such Together in this order, the regex filters out lines of chr1 on + strand
  • 59. 59 of 81 Finetuning filtering with regexes $ egrep '^chr1.++' TAIR9_mRNA.bed > out.txt chr1 2025600 2027271 AT1G06620.10 + 2025617 2027094 0 3541,322,4 chr1 16269074 16270513 AT1G43171.10 + 1626998816270327 0 chr1 28251959 28253619 AT1G75280.10 + 2825202928253355 0 chr1 693479 696382 AT1G03010.10 + 693479 696188 0 592,67,119 http://www.faqs.org/docs/abs/HTML/regexp.html http://www.slideshare.net/wvcrieki/bioinformatics-p2p3perlregexes-v2013wimvancriekinge?utm_source=slideshow&utm_medium=ssemail&u
  • 60. 60 of 81 Wc – count words in files A general tool for counting lines, words and characters: wc [options] file(s) c: show number of characters w: show number of words l: show number of lines How many mRNA entries are on chr1 of A. thaliana? $ wc -l chr1_TAIR9_mRNA.bed or $ grep chr1 TAIR9_mRNA.bed | wc -l
  • 61. 61 of 81 Translate To replace characters: $ tr 's1' 's2' ! tr always reads from stdin – you cannot specify any files as command line arguments. Characters in s1 are replaced by characters in s2. Example: $ echo 'James Watson' | tr '[a-z]' '[A-Z]' JAMES WATSON
  • 62. 62 of 81 Delete characters To remove a particular set of characters: $ tr -d 's1' Deletes all characters in s1 Example: $ tr –d 'r' < DOStext > UNIXtext
  • 63. 63 of 81 awk can extract and rearrange … specific fields and do calculations and manipulations on it. awk -F delim '{ print $x }' ● -F delim: the field separator (default is white space) ● $x the field number: $0: the complete line $1: first field $2: second field … NF is the cumber of fields (can also be taken for last field).
  • 64. 64 of 81 awk can extract and rearrange For example: TAIR9_mRNA.bed needs to be converted to .gff (general feature format). See the .gff format http://wiki.bits.vib.be/index.php/.gff With AWK this can easily be done! One line of .bed looks like: → needs to be one line of .gff chr1 2025600 2027271 AT1G06620.10 + 2025617 2027094 0 3541,322,429, 0,
  • 65. 65 of 81 awk can extract and rearrange $ awk '{print $1”tawktmRNAt”$2”t”$3”t” $5”t”$6”t0t”$4 }' TAIR9_mRNA.bed chr1 2025600 2027271 AT1G06620.10 + 2025617 2027094 0 3541,322,429, 0
  • 66. 66 of 81 Awk has also a filtering option Extraction of one or more fields from a tabular data stream of lines that match a given regex: awk -F delim '/regex/ { print $x }' Here is: ● Delim: the delimiter in the file ● regex: a regular expression The awk script is executed only if the line matches regex lines that do not match regex are removed from the stream Further excellent documentation on awk: http://www.grymoire.com/Unix/Awk.html
  • 67. 67 of 81 Cut selects columns Cut extracts fields from text files: ● Using fixed delimiter $ cut [-d delim] -f <fields> [file] ● chopping on fixed width $ cut -c <fields> [file] For <fields>: N the Nth element N-M element the Nth till the Mth element N- from the Nth element on -M till the Mth element The first element is 1.
  • 68. 68 of 81 Cutting columns from text files Fixed width example: Suppose there is a file fixed.txt with content 12345ABCDE67890FGHIJ To extract a range of characters: $ cut -c 6-10 fixed.txt ABCDE
  • 69. 69 of 81 Sorting output To sort alphabetically or numerically lines of text: $ sort [options] file(s) When more files are specified, they are read one by one, but all lines together are sorted.
  • 70. 70 of 81 Sorting options ● n sort numerically ● f fold – case-insensitive ● r reverse sort order ● ts use s as field separator (instead of space) ● kn sort on the n-th field (1 being the first field) Example: sort mRNA by chromosome number and next by number of exons. $ sort -n -k1 -k10 TAIR9_mRNA.bed > out.bed
  • 71. 71 of 81 Detecting unique records with uniq ● eliminate duplicate lines in a set of files ● display unique lines ● display and count duplicate lines Very important: uniq always needs from sorted input. Useful option: -c count the number of fields.
  • 72. 72 of 81 Eliminate duplicates ● Example: $ who root tty1 Oct 16 23:20 james tty2 Oct 16 23:20 james pts/0 Oct 16 23:21 james pts/1 Oct 16 23:22 james pts/2 Oct 16 23:22 $ who | awk '{print $1}' | sort | uniq james root
  • 73. 73 of 81 Display unique or duplicate lines ● To display lines that occur only once: $ uniq -u file(s) ● To display lines that occur more than once: $ uniq -d file(s) Example: $ who|awk '{print $1}'|sort|uniq -d james ● To display the counts of the lines $ uniq -c file(s) Example $ who | awk '{print $1}' | sort | uniq -c 4 james 1 root
  • 74. 74 of 81 Edit per line with sed Sed (the stream editor) can make changes in text per line. It works on files or on STDIN. See http://www.grymoire.com/Unix/Sed.html This is also a very big tool, but we will only look to the substitute function (the most used one). $ sed -e 's/r1/s1/' file(s) s: the substitute command /: separator r1: regex to be replaced s1: text that will replace the regex match
  • 75. 75 of 81 Paste several lines together Paste allows you to concatenate every n lines into one line, ideal for manipulating fastq files. We can use sed for this together with paste. $ paste - - - - < in.fq | cut -f 1,2 | sed 's/^@/>/' | tr "t" "n" > out.fa http://thegenomefactory.blogspot.be/2012/05/cool-use-of-unix-paste-with-ngs.html
  • 76. 76 of 81 Transpose is not a standard tool http://sourceforge.net/projects/transpose/ But it is extremely useful. It transposes tabular text files, exchanging columns for row and vice versa.
  • 77. 77 of 81 Building your pipelines grep tr sed awk sort wc uniq uniq transpose paste
  • 78. 78 of 81 Fill in the tools Filter on lines ánd select different columns: ….................... Merge identical fields: …........... Filter lines: …............ Sort lines: …..... Select columns: …............ Replace characters: …..... Replace strings: …............ Transpose: ….....Numerical summary: ….....
  • 79. 79 of 81 Exercise → Text manipulation exercises
  • 80. 80 of 81 Keywords Environment variable PATH shebang script argument STDIN pipe comment Write in your own words what the terms mean