Bioinformatica: Leggere file con Perl, e introduzione alle espressioni regolari (BMR Genomics) - Lezione 29 agosto 2014

Bioinformatics Training
Introduction to Perl Programming
Andrea Telatin

Today’s menu
• Homeworks: Shotgun simulation
• Reading a file (line by line): SAM format
• Pattern matching

Simulating a Shotgun
Homework review

Print a random substring
INPUT:
$sequenza =
‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA
TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG
ACTGATC’;
$lunghezza = length($sequenza); 
$lunghezza_read = 12; 
$posizione = int(rand($lunghezza - $lunghezza_read); 
$seq = substr($sequenza, $posizione, $lunghezza_read); 
print “>Randomn$seqn”;

$sequenza =
ACTGATC’;
COMPUTATION

$sequenza =
ACTGATC’;
OUTPUT

Finished example
$seq = ‘ACACTAGGTACGTAGCTGATCGTACGTAGCTGACG…’;
$readnum = 100;
$readlen = 40;!
!
for ($i = 1; $i<=$readnum; $i++) {!
$pos = int(rand(length($seq)-$readlen));!
$readseq = substr($seq, $pos, $readlen);!
print ">READ_$i pos=$posn$readseqn";!
}

Finished example
$seq = ‘ACACTAGGTACGTAGCTGATCGTACGTAGCTGACG…’;
$readnum = 100;
$readlen = 40;!
!
for ($i = 1; $i<=$readnum; $i++) {!
}
“HARD CODED” INPUT
http://codepad.org/2OinaEX1

Taking arguments
from the shell

We already met @ARGV
print STDERR “Instructions…n”; 
 
($seq, $readlen, $readnum) = @ARGV; 
if ($readnum==0) { 
! ! die “Please specify all the params.n” 
}
!
for ($i = 1; $i<=$readnum; $i++) {!
}
USER SUPPLIED INPUT

Reading a text file
• Requires the program to know the filename, that in
the BaSH jargon means its path (either relative or
absolute)
• In Perl we assign a nickname to the opened file,
called file handle.
• We use the open(FH, filename) function
• The function returns “true” on success.

Reading a File
if (open(FILE, ‘path/to/filename’) == 0) { 
die “ Cant read filename.n”; 
}!
After opening it, we can read it line by line with a while
loop: 
while ($currentline = <FILE>) { 
print “$currentline”; 
}!
Warning: $currentline keeps its return char, if not
needed remember to remove it with chomp($riga)!

Reading a File
if (open(FILE, ‘path/to/filename’) == 0) { 
die “ Cant read filename.n”; 
}!
After opening it, we can read it line by line with a while
loop: 
while ($currentline = <FILE>) { 
print “$currentline”; 
}!
Warning: $currentline keeps its return char, if not
needed remember to remove it with chomp($riga)!
File Handle File Name
TYPICAL “READ FILE” LOOP

A Perl clone: cat.pl
if (open(I, “$filename”) == 0) { 
die “Cant read “$filename”.n”; 
} 
 
while ($line = <I>) { 
chomp($line); 
$c++; 
print “$linen”; 
}!
print STDERR “I read $c lines.n”;

A Perl clone: cat.pl
if (open(I, “$filename”) == 0) { 
die “Cant read “$filename”.n”; 
} 
 
while ($line = <I>) { 
chomp($line); 
$c++; 
print “$linen”; 
}!
print STDERR “I read $c lines.n”;
With built-in “wc” too

Parsing a file 
SAM as a common example

SAM format review
• Header: starts by @
• Alignments in tabular format:

Two simple examples
• SAM to FASTQ (extracts reads from SAM file)
• Subsample SAM

SAM to FASTQ
1. Print user manual 
2. Get arguments from command line via @ARGV 
3. Open the SAM file, we’ll use “SAM” as file handle
 
while ($line = <SAM>) {!
chomp($line);!
@sam_fields = split(/t/, $line);!
print “@$sam_fields[0]n”; 
print “$sam_fields[9]n+n”; "
print “$sam_fields[10]n”;!
}

Subsample SAM
In the simplest implementation we just want to print
one line every pack of lines. I.e. 1 out of 10 (= 10%).

Subsample SAM
 
($sam_file, $denom) = @ARGV;!
!
if (open(SAM, "$sam_file")==0) {!
die "Unable to read SAM file: "$sam_file".n";!
}!
!
$c++;!
if ($c % $denom == 0) {!
print $line;!
}!
}

Subsample SAM
 
($sam_file, $denom) = @ARGV;!
!
if (open(SAM, "$sam_file")==0) {!
die "Unable to read SAM file: "$sam_file".n";!
}!
!
$c++;!
if ($c % $denom == 0) {!
print $line;!
}!
}
is there something to fix here?

Another example
Adding a little bit of spice

Alignments per chromosome
• We basically want to count how many
alignments were found per reference sequence
• We might be interested, also, in normalising the
number of alignments by chromosome length

COFFE BREAK 
to let you think about the program :)

Let’s do it together
1. We need to store the chromosome length. 
That information is on the header 
How can we store it? 
2. We need a counter for each chromosome. 
We don’t know in advance how many are there. 
Any idea for this?"
!

chomp($line);!
if (substr($line, 0, 1) eq ‘@‘) {!
! ! #header parsing!
! } else {!
! ! #alignments parsing!
}!
}"
Global structure

($field, $content) = split(/t/, $line);!
if ($field eq ‘@SQ’) {!
($seqname, $len) = split(/t/, $content);!
$seqname = substr($seqname, 2);!
…!
! $size{$seqname} = $len;!
}"
Header parsing

1. We need to store the chromosome length. 
That information is on the header 
How can we store it? 
2. We need a counter for each chromosome. 
We don’t know in advance how many are there. 
Any idea for this?"
!
Header parsing

Pattern matching
Brief introduction

What is it?
• The key feature making Perl so powerful in
processing text (i.e. parsing)
• A mini language inside the language
• Something like wild cards (but much more
powerful) in BaSH
• Used to find data and substituting it

The match operator
• Patterns are usually delimited by /
• Returns true if the text contains the pattern
• Example: 
if ($dna=~/GGATCC/) { 
print “BamHI sensitive sequencen”; 
}

The substitution operator
• Syntax is s/PATTERN/SUBSTITUTION/
• Returns true if the text contains the pattern
• The modifier “g” at the end substitutes all the
occurrences
• Example: 
$dna=~s/GGATCC/-cut-/g;

Before starting…
• http://regex101.com and many other
online tools can do that too…
• Let’s create our own regex tester :)

Metacharacters
• ^ Requires the pattern to be found at the begin
• $ Requires the pattern to be found at the end
• . Matches any character (except for n)
• s Matches a whitespace (space, tab…)
• S Matches a non-whitespace (everything else)

Metacharacters (2)
• d Matches any digit (D any non-digit)
• w Matches any alphanum (W …)
• . Matches any character (except for n)
• s Matches a whitespace (space, tab…)
• S Matches a non-whitespace (everything else)

Quantifiers
• Added at the end of an entity
• ? zero or one
• * zero or more
• + at least one (one or more)
• {n} n or more
• {n,m} between n and m

Character classes
• List of character, and meta-, enclosed by [ ] 
In this context the ^ means not
• [aeiou] any vowel
• [A-Z] any upper case
• [^a-z] non lowercase letters
• [ACGTacgt] any DNA letter

Alternatives
• (also) in this context the pipe char is the “or”
• /TTC|GTA/ matches either TTC or GTA
• Grouped by parentheses: 
/mi chiamo (Andrea|Paolo)/

Capturing portions
• The parentheses are used to capture a portion of
pattern

Bioinformatica: Leggere file con Perl, e introduzione alle espressioni regolari (BMR Genomics) - Lezione 29 agosto 2014

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Bioinformatica: Leggere file con Perl, e introduzione alle espressioni regolari (BMR Genomics) - Lezione 29 agosto 2014 (20)

Bioinformatica: Leggere file con Perl, e introduzione alle espressioni regolari (BMR Genomics) - Lezione 29 agosto 2014