SlideShare a Scribd company logo
Bioinformatics Training
Introduction to Perl Programming
Andrea Telatin
Today’s menu
• Homeworks: Shotgun simulation
• Reading a file (line by line): SAM format
• Pattern matching
Simulating a Shotgun
Homework review
Simulating a Shotgun
Homework review
Print a random substring
INPUT:
$sequenza =
‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA
TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG
ACTGATC’;
$lunghezza = length($sequenza);

$lunghezza_read = 12;

$posizione = int(rand($lunghezza - $lunghezza_read);

$seq = substr($sequenza, $posizione, $lunghezza_read);

print “>Randomn$seqn”;
Print a random substring
$sequenza =
‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA
TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG
ACTGATC’;
$lunghezza = length($sequenza);

$lunghezza_read = 12;

$posizione = int(rand($lunghezza - $lunghezza_read);

$seq = substr($sequenza, $posizione, $lunghezza_read);

print “>Randomn$seqn”;
COMPUTATION
$sequenza =
‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA
TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG
ACTGATC’;
$lunghezza = length($sequenza);

$lunghezza_read = 12;

$posizione = int(rand($lunghezza - $lunghezza_read);

$seq = substr($sequenza, $posizione, $lunghezza_read);

print “>Randomn$seqn”;
Print a random substring
OUTPUT
Finished example
$seq = ‘ACACTAGGTACGTAGCTGATCGTACGTAGCTGACG…’;
$readnum = 100;
$readlen = 40;!
!
for ($i = 1; $i<=$readnum; $i++) {!
$pos = int(rand(length($seq)-$readlen));!
$readseq = substr($seq, $pos, $readlen);!
print ">READ_$i pos=$posn$readseqn";!
}
Finished example
$seq = ‘ACACTAGGTACGTAGCTGATCGTACGTAGCTGACG…’;
$readnum = 100;
$readlen = 40;!
!
for ($i = 1; $i<=$readnum; $i++) {!
$pos = int(rand(length($seq)-$readlen));!
$readseq = substr($seq, $pos, $readlen);!
print ">READ_$i pos=$posn$readseqn";!
}
“HARD CODED” INPUT
http://codepad.org/2OinaEX1
Taking arguments
from the shell
We already met @ARGV
print STDERR “Instructions…n”;



($seq, $readlen, $readnum) = @ARGV;

if ($readnum==0) {

! ! die “Please specify all the params.n”

}
!
for ($i = 1; $i<=$readnum; $i++) {!
$pos = int(rand(length($seq)-$readlen));!
$readseq = substr($seq, $pos, $readlen);!
print ">READ_$i pos=$posn$readseqn";!
}
USER SUPPLIED INPUT
Read a Text File
Reading a text file
• Requires the program to know the filename, that in
the BaSH jargon means its path (either relative or
absolute)
• In Perl we assign a nickname to the opened file,
called file handle.
• We use the open(FH, filename) function
• The function returns “true” on success.
Reading a File
if (open(FILE, ‘path/to/filename’) == 0) {

die “ Cant read filename.n”;

}!
After opening it, we can read it line by line with a while
loop:

while ($currentline = <FILE>) {

print “$currentline”;

}!
Warning: $currentline keeps its return char, if not
needed remember to remove it with chomp($riga)!
Reading a File
if (open(FILE, ‘path/to/filename’) == 0) {

die “ Cant read filename.n”;

}!
After opening it, we can read it line by line with a while
loop:

while ($currentline = <FILE>) {

print “$currentline”;

}!
Warning: $currentline keeps its return char, if not
needed remember to remove it with chomp($riga)!
File Handle File Name
TYPICAL “READ FILE” LOOP
A Perl clone: cat.pl
if (open(I, “$filename”) == 0) {

die “Cant read “$filename”.n”;

}



while ($line = <I>) {

chomp($line);

$c++;

print “$linen”;

}!
print STDERR “I read $c lines.n”;
A Perl clone: cat.pl
if (open(I, “$filename”) == 0) {

die “Cant read “$filename”.n”;

}



while ($line = <I>) {

chomp($line);

$c++;

print “$linen”;

}!
print STDERR “I read $c lines.n”;
With built-in “wc” too
Parsing a file

SAM as a common example
SAM format review
• Header: starts by @
• Alignments in tabular format:
SAM format review
• Header: starts by @
• Alignments in tabular format:
Two simple examples
• SAM to FASTQ (extracts reads from SAM file)
• Subsample SAM
SAM to FASTQ
1. Print user manual

2. Get arguments from command line via @ARGV

3. Open the SAM file, we’ll use “SAM” as file handle


while ($line = <SAM>) {!
chomp($line);!
@sam_fields = split(/t/, $line);!
print “@$sam_fields[0]n”;

print “$sam_fields[9]n+n”; "
print “$sam_fields[10]n”;!
}
Subsample SAM
In the simplest implementation we just want to print
one line every pack of lines. I.e. 1 out of 10 (= 10%).
Subsample SAM
1. Print user manual



($sam_file, $denom) = @ARGV;!
!
if (open(SAM, "$sam_file")==0) {!
die "Unable to read SAM file: "$sam_file".n";!
}!
!
while ($line = <SAM>) {!
$c++;!
if ($c % $denom == 0) {!
print $line;!
}!
}
Subsample SAM
1. Print user manual



($sam_file, $denom) = @ARGV;!
!
if (open(SAM, "$sam_file")==0) {!
die "Unable to read SAM file: "$sam_file".n";!
}!
!
while ($line = <SAM>) {!
$c++;!
if ($c % $denom == 0) {!
print $line;!
}!
}
is there something to fix here?
Another example
Adding a little bit of spice
Alignments per chromosome
• We basically want to count how many
alignments were found per reference sequence
• We might be interested, also, in normalising the
number of alignments by chromosome length
SAM format review
• Header: starts by @
• Alignments in tabular format:
COFFE BREAK

to let you think about the program :)
Let’s do it together
Let’s do it together
1. We need to store the chromosome length.

That information is on the header

How can we store it?

2. We need a counter for each chromosome.

We don’t know in advance how many are there.

Any idea for this?"
!
Let’s do it together
while ($line = <SAM>) {!
chomp($line);!
if (substr($line, 0, 1) eq ‘@‘) {!
! ! #header parsing!
! } else {!
! ! #alignments parsing!
}!
}"
Global structure
Let’s do it together
($field, $content) = split(/t/, $line);!
if ($field eq ‘@SQ’) {!
($seqname, $len) = split(/t/, $content);!
$seqname = substr($seqname, 2);!
…!
! $size{$seqname} = $len;!
}"
Header parsing
Let’s do it together
1. We need to store the chromosome length.

That information is on the header

How can we store it?

2. We need a counter for each chromosome.

We don’t know in advance how many are there.

Any idea for this?"
!
Header parsing
Pattern matching
Brief introduction
What is it?
• The key feature making Perl so powerful in
processing text (i.e. parsing)
• A mini language inside the language
• Something like wild cards (but much more
powerful) in BaSH
• Used to find data and substituting it
The match operator
• Patterns are usually delimited by /
• Returns true if the text contains the pattern
• Example:

if ($dna=~/GGATCC/) {

print “BamHI sensitive sequencen”;

}
The substitution operator
• Syntax is s/PATTERN/SUBSTITUTION/
• Returns true if the text contains the pattern
• The modifier “g” at the end substitutes all the
occurrences
• Example:

$dna=~s/GGATCC/-cut-/g;

Before starting…
• http://regex101.com and many other
online tools can do that too…
• Let’s create our own regex tester :)
Metacharacters
• ^ Requires the pattern to be found at the begin
• $ Requires the pattern to be found at the end
• . Matches any character (except for n)
• s Matches a whitespace (space, tab…)
• S Matches a non-whitespace (everything else)
Metacharacters (2)
• d Matches any digit (D any non-digit)
• w Matches any alphanum (W …)
• . Matches any character (except for n)
• s Matches a whitespace (space, tab…)
• S Matches a non-whitespace (everything else)
Quantifiers
• Added at the end of an entity
• ? zero or one
• * zero or more
• + at least one (one or more)
• {n} n or more
• {n,m} between n and m
Character classes
• List of character, and meta-, enclosed by [ ]

In this context the ^ means not
• [aeiou] any vowel
• [A-Z] any upper case
• [^a-z] non lowercase letters
• [ACGTacgt] any DNA letter
Alternatives
• (also) in this context the pipe char is the “or”
• /TTC|GTA/ matches either TTC or GTA
• Grouped by parentheses:

/mi chiamo (Andrea|Paolo)/
Capturing portions
• The parentheses are used to capture a portion of
pattern

More Related Content

PPTX
Bioinformatica p4-io
PPTX
First steps in C-Shell
KEY
(Parameterized) Roles
PDF
COSCUP2012: How to write a bash script like the python?
PPT
You Can Do It! Start Using Perl to Handle Your Voyager Needs
PPT
Perl Basics for Pentesters Part 1
Bioinformatica p4-io
First steps in C-Shell
(Parameterized) Roles
COSCUP2012: How to write a bash script like the python?
You Can Do It! Start Using Perl to Handle Your Voyager Needs
Perl Basics for Pentesters Part 1

What's hot (20)

PPTX
Perl basics for pentesters part 2
PDF
003 scripting
PDF
BASH Guide Summary
PDF
DBIx::Class introduction - 2010
PDF
Beautiful Bash: Let's make reading and writing bash scripts fun again!
PPT
Talk Unix Shell Script
PDF
IO Streams, Files and Directories
PPT
Talk Unix Shell Script 1
PPT
Bash shell
PPT
PDF
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
PPTX
Licão 12 decision loops - statement iteration
PPS
UNIX - Class1 - Basic Shell
PPT
Airlover 20030324 1
PDF
DBIx::Class beginners
PPTX
Unix shell scripting basics
PDF
Dades i operadors
PDF
Perl6 Regexen: Reduce the line noise in your code.
PPT
Shell Scripts
PPTX
Bash Shell Scripting
Perl basics for pentesters part 2
003 scripting
BASH Guide Summary
DBIx::Class introduction - 2010
Beautiful Bash: Let's make reading and writing bash scripts fun again!
Talk Unix Shell Script
IO Streams, Files and Directories
Talk Unix Shell Script 1
Bash shell
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Licão 12 decision loops - statement iteration
UNIX - Class1 - Basic Shell
Airlover 20030324 1
DBIx::Class beginners
Unix shell scripting basics
Dades i operadors
Perl6 Regexen: Reduce the line noise in your code.
Shell Scripts
Bash Shell Scripting
Ad

Viewers also liked (20)

PDF
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
PDF
Bioinformatica: introduzione (BMR Genomics) - Lezione 25 luglio 2014
PPTX
World Thinking Day 2011 Greetings
PPT
Anti kurenie
PDF
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
PPT
PDF
04596361
PDF
CASTELAO
PPTX
Trivia name sport
PPTX
Flipper o Golfiño
PPTX
Rainbow-venture lab
DOCX
PDF
Laboratorio di Biologia Molecolare I - UniPD - 2010
PDF
PHP 7.0 new features (and new interpreter)
PPTX
Combarro no Bildarchiv Foto Marburg
PDF
Introduzione al Perl (BMR Genomics) - Lezione 1 Agosto 2014
PDF
Uno sguardo al microbioma degli Italiani
PDF
Introduction to 16S Analysis with NGS - BMR Genomics
PDF
Sequenziamento ed assemblaggio di genomi batterici
DOCX
50 contoh-soal-bahasa-inggris-kelas-4-sd-inggris-online
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: introduzione (BMR Genomics) - Lezione 25 luglio 2014
World Thinking Day 2011 Greetings
Anti kurenie
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
04596361
CASTELAO
Trivia name sport
Flipper o Golfiño
Rainbow-venture lab
Laboratorio di Biologia Molecolare I - UniPD - 2010
PHP 7.0 new features (and new interpreter)
Combarro no Bildarchiv Foto Marburg
Introduzione al Perl (BMR Genomics) - Lezione 1 Agosto 2014
Uno sguardo al microbioma degli Italiani
Introduction to 16S Analysis with NGS - BMR Genomics
Sequenziamento ed assemblaggio di genomi batterici
50 contoh-soal-bahasa-inggris-kelas-4-sd-inggris-online
Ad

Similar to Bioinformatica: Leggere file con Perl, e introduzione alle espressioni regolari (BMR Genomics) - Lezione 29 agosto 2014 (20)

PDF
Lecture19-20
PDF
Lecture19-20
PPTX
Bioinformatics p4-io v2013-wim_vancriekinge
PPTX
Bioinformatics p1-perl-introduction v2013
PDF
Introduction to Perl and BioPerl
PDF
Scripting3
PPT
Introduction to Perl
PPT
Basic perl programming
PDF
perl-pocket
PDF
perl-pocket
PDF
perl-pocket
PDF
perl-pocket
PPTX
Perl courseparti
PPTX
Bioinformatics v2014 wim_vancriekinge
PDF
newperl5
PDF
newperl5
PDF
Lecture 22
PDF
Perl_Tutorial_v1
PDF
Perl_Tutorial_v1
Lecture19-20
Lecture19-20
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p1-perl-introduction v2013
Introduction to Perl and BioPerl
Scripting3
Introduction to Perl
Basic perl programming
perl-pocket
perl-pocket
perl-pocket
perl-pocket
Perl courseparti
Bioinformatics v2014 wim_vancriekinge
newperl5
newperl5
Lecture 22
Perl_Tutorial_v1
Perl_Tutorial_v1

Bioinformatica: Leggere file con Perl, e introduzione alle espressioni regolari (BMR Genomics) - Lezione 29 agosto 2014

  • 1. Bioinformatics Training Introduction to Perl Programming Andrea Telatin
  • 2. Today’s menu • Homeworks: Shotgun simulation • Reading a file (line by line): SAM format • Pattern matching
  • 5. Print a random substring INPUT: $sequenza = ‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG ACTGATC’; $lunghezza = length($sequenza);
 $lunghezza_read = 12;
 $posizione = int(rand($lunghezza - $lunghezza_read);
 $seq = substr($sequenza, $posizione, $lunghezza_read);
 print “>Randomn$seqn”;
  • 6. Print a random substring $sequenza = ‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG ACTGATC’; $lunghezza = length($sequenza);
 $lunghezza_read = 12;
 $posizione = int(rand($lunghezza - $lunghezza_read);
 $seq = substr($sequenza, $posizione, $lunghezza_read);
 print “>Randomn$seqn”; COMPUTATION
  • 7. $sequenza = ‘ACGATCGTACGTAGCTGACTGATCGTACGTAACGTACGTAGCTGACTGATCGA TCGTAGCTAGCAGCTGATCGACTGACTGACTGATCGATCGTACGTAGCTAGCTG ACTGATC’; $lunghezza = length($sequenza);
 $lunghezza_read = 12;
 $posizione = int(rand($lunghezza - $lunghezza_read);
 $seq = substr($sequenza, $posizione, $lunghezza_read);
 print “>Randomn$seqn”; Print a random substring OUTPUT
  • 8. Finished example $seq = ‘ACACTAGGTACGTAGCTGATCGTACGTAGCTGACG…’; $readnum = 100; $readlen = 40;! ! for ($i = 1; $i<=$readnum; $i++) {! $pos = int(rand(length($seq)-$readlen));! $readseq = substr($seq, $pos, $readlen);! print ">READ_$i pos=$posn$readseqn";! }
  • 9. Finished example $seq = ‘ACACTAGGTACGTAGCTGATCGTACGTAGCTGACG…’; $readnum = 100; $readlen = 40;! ! for ($i = 1; $i<=$readnum; $i++) {! $pos = int(rand(length($seq)-$readlen));! $readseq = substr($seq, $pos, $readlen);! print ">READ_$i pos=$posn$readseqn";! } “HARD CODED” INPUT http://codepad.org/2OinaEX1
  • 11. We already met @ARGV print STDERR “Instructions…n”;
 
 ($seq, $readlen, $readnum) = @ARGV;
 if ($readnum==0) {
 ! ! die “Please specify all the params.n”
 } ! for ($i = 1; $i<=$readnum; $i++) {! $pos = int(rand(length($seq)-$readlen));! $readseq = substr($seq, $pos, $readlen);! print ">READ_$i pos=$posn$readseqn";! } USER SUPPLIED INPUT
  • 12. Read a Text File
  • 13. Reading a text file • Requires the program to know the filename, that in the BaSH jargon means its path (either relative or absolute) • In Perl we assign a nickname to the opened file, called file handle. • We use the open(FH, filename) function • The function returns “true” on success.
  • 14. Reading a File if (open(FILE, ‘path/to/filename’) == 0) {
 die “ Cant read filename.n”;
 }! After opening it, we can read it line by line with a while loop:
 while ($currentline = <FILE>) {
 print “$currentline”;
 }! Warning: $currentline keeps its return char, if not needed remember to remove it with chomp($riga)!
  • 15. Reading a File if (open(FILE, ‘path/to/filename’) == 0) {
 die “ Cant read filename.n”;
 }! After opening it, we can read it line by line with a while loop:
 while ($currentline = <FILE>) {
 print “$currentline”;
 }! Warning: $currentline keeps its return char, if not needed remember to remove it with chomp($riga)! File Handle File Name TYPICAL “READ FILE” LOOP
  • 16. A Perl clone: cat.pl if (open(I, “$filename”) == 0) {
 die “Cant read “$filename”.n”;
 }
 
 while ($line = <I>) {
 chomp($line);
 $c++;
 print “$linen”;
 }! print STDERR “I read $c lines.n”;
  • 17. A Perl clone: cat.pl if (open(I, “$filename”) == 0) {
 die “Cant read “$filename”.n”;
 }
 
 while ($line = <I>) {
 chomp($line);
 $c++;
 print “$linen”;
 }! print STDERR “I read $c lines.n”; With built-in “wc” too
  • 18. Parsing a file
 SAM as a common example
  • 19. SAM format review • Header: starts by @ • Alignments in tabular format:
  • 20. SAM format review • Header: starts by @ • Alignments in tabular format:
  • 21. Two simple examples • SAM to FASTQ (extracts reads from SAM file) • Subsample SAM
  • 22. SAM to FASTQ 1. Print user manual
 2. Get arguments from command line via @ARGV
 3. Open the SAM file, we’ll use “SAM” as file handle 
 while ($line = <SAM>) {! chomp($line);! @sam_fields = split(/t/, $line);! print “@$sam_fields[0]n”;
 print “$sam_fields[9]n+n”; " print “$sam_fields[10]n”;! }
  • 23. Subsample SAM In the simplest implementation we just want to print one line every pack of lines. I.e. 1 out of 10 (= 10%).
  • 24. Subsample SAM 1. Print user manual
 
 ($sam_file, $denom) = @ARGV;! ! if (open(SAM, "$sam_file")==0) {! die "Unable to read SAM file: "$sam_file".n";! }! ! while ($line = <SAM>) {! $c++;! if ($c % $denom == 0) {! print $line;! }! }
  • 25. Subsample SAM 1. Print user manual
 
 ($sam_file, $denom) = @ARGV;! ! if (open(SAM, "$sam_file")==0) {! die "Unable to read SAM file: "$sam_file".n";! }! ! while ($line = <SAM>) {! $c++;! if ($c % $denom == 0) {! print $line;! }! } is there something to fix here?
  • 26. Another example Adding a little bit of spice
  • 27. Alignments per chromosome • We basically want to count how many alignments were found per reference sequence • We might be interested, also, in normalising the number of alignments by chromosome length
  • 28. SAM format review • Header: starts by @ • Alignments in tabular format:
  • 29. COFFE BREAK
 to let you think about the program :)
  • 30. Let’s do it together
  • 31. Let’s do it together 1. We need to store the chromosome length.
 That information is on the header
 How can we store it?
 2. We need a counter for each chromosome.
 We don’t know in advance how many are there.
 Any idea for this?" !
  • 32. Let’s do it together while ($line = <SAM>) {! chomp($line);! if (substr($line, 0, 1) eq ‘@‘) {! ! ! #header parsing! ! } else {! ! ! #alignments parsing! }! }" Global structure
  • 33. Let’s do it together ($field, $content) = split(/t/, $line);! if ($field eq ‘@SQ’) {! ($seqname, $len) = split(/t/, $content);! $seqname = substr($seqname, 2);! …! ! $size{$seqname} = $len;! }" Header parsing
  • 34. Let’s do it together 1. We need to store the chromosome length.
 That information is on the header
 How can we store it?
 2. We need a counter for each chromosome.
 We don’t know in advance how many are there.
 Any idea for this?" ! Header parsing
  • 36. What is it? • The key feature making Perl so powerful in processing text (i.e. parsing) • A mini language inside the language • Something like wild cards (but much more powerful) in BaSH • Used to find data and substituting it
  • 37. The match operator • Patterns are usually delimited by / • Returns true if the text contains the pattern • Example:
 if ($dna=~/GGATCC/) {
 print “BamHI sensitive sequencen”;
 }
  • 38. The substitution operator • Syntax is s/PATTERN/SUBSTITUTION/ • Returns true if the text contains the pattern • The modifier “g” at the end substitutes all the occurrences • Example:
 $dna=~s/GGATCC/-cut-/g;

  • 39. Before starting… • http://regex101.com and many other online tools can do that too… • Let’s create our own regex tester :)
  • 40. Metacharacters • ^ Requires the pattern to be found at the begin • $ Requires the pattern to be found at the end • . Matches any character (except for n) • s Matches a whitespace (space, tab…) • S Matches a non-whitespace (everything else)
  • 41. Metacharacters (2) • d Matches any digit (D any non-digit) • w Matches any alphanum (W …) • . Matches any character (except for n) • s Matches a whitespace (space, tab…) • S Matches a non-whitespace (everything else)
  • 42. Quantifiers • Added at the end of an entity • ? zero or one • * zero or more • + at least one (one or more) • {n} n or more • {n,m} between n and m
  • 43. Character classes • List of character, and meta-, enclosed by [ ]
 In this context the ^ means not • [aeiou] any vowel • [A-Z] any upper case • [^a-z] non lowercase letters • [ACGTacgt] any DNA letter
  • 44. Alternatives • (also) in this context the pipe char is the “or” • /TTC|GTA/ matches either TTC or GTA • Grouped by parentheses:
 /mi chiamo (Andrea|Paolo)/
  • 45. Capturing portions • The parentheses are used to capture a portion of pattern