SlideShare a Scribd company logo
P3 2017 python_regexes
FBW
17-10-2017
Wim Van Criekinge
Google Calendar
Recap
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Strings
Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]
Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
Reverse Complement Revisited
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue
Strings
REGULAR EXPRESSIONS
Regular Expressions
http://en.wikipedia.org/wiki/Regular_expression
In computing, a regular expression, also
referred to as "regex" or "regexp", provides a
concise and flexible means for matching
strings of text, such as particular characters,
words, or patterns of characters. A regular
expression is written in a formal language that
can be interpreted by a regular expression
processor.
Really clever "wild card" expressions for
matching and parsing strings.
Understanding Regular Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language
unto themselves
• A language of "marker characters" -
programming with characters
• It is kind of an "old school"
language - compact
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
s Matches whitespace
S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression Module
• Before you can use regular expressions in
your program, you must import the library
using "import re"
• You can use re.search() to see if a string
matches a regular expression similar to
using the find() method for strings
• You can use re.findall() extract portions of
a string that match your regular expression
similar to a combination of find() and
slicing: var[5:10]
Wild-Card Characters
• The dot character matches any
character
• If you add the asterisk character,
the character is "any number of
times"
^X.*:
Match the start of the line
Match any character
Many times
Matching and Extracting Data
• The re.search() returns a True/False
depending on whether the string matches
the regular expression
• If we actually want the matching strings
to be extracted, we use re.findall()
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print y
['2', '19', '42']
Warning: Greedy Matching
• The repeat characters (* and +) push outward in both directions
(greedy) to match the largest possible string
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print y
['From: Using the :']
^F.+:
One or more
characters
First character in the
match is an F
Last character in the
match is a :
Non-Greedy Matching
• Not all regular expression repeat codes are
greedy! If you add a ? character - the + and *
chill out a bit...
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+?:', x)
>>> print y
['From:']
^F.+?:
One or more
characters but
not greedily
First character in the
match is an F
Last character in the
match is a :
Fine Tuning String Extraction
• Parenthesis are not part of the match -
but they tell where to start and stop what
string to extract
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008
>>> y = re.findall('S+@S+',x)
>>> print y
['stephen.marquard@uct.ac.za']
>>> y = re.findall('^From (S+@S+)',x)
>>> print y
['stephen.marquard@uct.ac.za']
^From (S+@S+)
The Double Split Version
• Sometimes we split a line one way and then grab
one of the pieces of the line and split that piece
again
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008
words = line.split()
email = words[1]
pieces = email.split('@')
print pieces[1]
stephen.marquard@uct.ac.za
['stephen.marquard', 'uct.ac.za']
'uct.ac.za'
The Regex Version
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16
2008
import re
lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:1
y = re.findall('@([^ ]*)',lin)
print y['uct.ac.za']
'@([^ ]*)'
Look through the string until you find an at-sign
Match non-blank character
Match many of them
Escape Character
• If you want a special regular expression
character to just behave normally (most
of the time) you prefix it with ''
>>> import re
>>> x = 'We just received $10.00 for cookies.'
>>> y = re.findall('$[0-9.]+',x)
>>> print y
['$10.00']
$[0-9.]+
A digit or periodA real dollar sign
At least one
or more
Real world problems
• Match IP Addresses, email addresses,
URLs
• Match balanced sets of parenthesis
• Substitute words
• Tokenize
• Validate
• Count
• Delete duplicates
• Natural Language processing
P3 2017 python_regexes
P3 2017 python_regexes
RE in Python
• Unleash the power - built-in re module
• Functions
– to compile patterns
• compile
– to perform matches
• match, search, findall, finditer
– to perform operations on match object
• group, start, end, span
– to substitute
• sub, subn
• - Metacharacters
Examples 1
pattern = re.compile(r"tes")
print (pattern.findall("test testing"))
Examples 2
import re
dna = "ATCGCGAATTCAC"
if re.search(r"GAATTC", dna):
print("restriction site found!")
Examples 3
scientific_name = "Homo sapiens"
m = re.search("(.+) (.+)", scientific_name)
if m:
genus = m.group(1)
species = m.group(2)
print("genus is " + genus + ", species is " + species)
Examples 4
regex = r"([a-zA-Z]+) d+"
#finditer() returns an iterator that produces Match instances instead of the strings
returned by findall()
matches = re.finditer(regex, "June 24, August 9, Dec 12")
for match in matches:
print(match)
print ("Match at index:",match.group(0),match.group(1),match.start(), match.end())
Examples 5
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print ('Found "%s" at %d:%d' % (text[s:e], s, e))
Exercise 1
1. Which of following 4 sequences
(seq1/2/3/4)
a) contains a “Galactokinase signature”
b) How many of them?
http://us.expasy.org/prosite/
>SEQ1
MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT
YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS
LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS
LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT
NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK
SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVA
M MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI
ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK
SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2
MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE
VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK
HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM
ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES
SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN
ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK
ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ
>SEQ3
MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY
SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLA
CISVDRY
LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP
QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET
CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGS
SSGH TSTTL
>SEQ4
MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG
GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG
FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK
ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ
IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR
FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI
GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF
DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC
VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI
NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG
KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE
GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA
SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE
KKGLA
Oefening 1
http://www.pythonchallenge.com

More Related Content

PDF
20170509 rand db_lesugent
PPTX
P2 2017 python_strings
PPTX
2017 biological databasespart2
KEY
Programming Haskell Chapter8
PDF
Python Cheat Sheet
PDF
Python 2.5 reference card (2009)
PDF
Ejercicios de estilo en la programación
PDF
Python programming : List and tuples
20170509 rand db_lesugent
P2 2017 python_strings
2017 biological databasespart2
Programming Haskell Chapter8
Python Cheat Sheet
Python 2.5 reference card (2009)
Ejercicios de estilo en la programación
Python programming : List and tuples

What's hot (13)

PDF
Python3 cheatsheet
PDF
A tour of Python
PDF
Python Workshop Part 2. LUG Maniapl
PDF
Beginners python cheat sheet - Basic knowledge
 
PDF
Introduction to Python
PPTX
Python crush course
PDF
Python_ 3 CheatSheet
PPT
R for Statistical Computing
PDF
Beginning Haskell, Dive In, Its Not That Scary!
PDF
JDD2015: Functional programing and Event Sourcing - a pair made in heaven - e...
PDF
Python : Regular expressions
PDF
Mementopython3 english
PPTX
Datastructures in python
Python3 cheatsheet
A tour of Python
Python Workshop Part 2. LUG Maniapl
Beginners python cheat sheet - Basic knowledge
 
Introduction to Python
Python crush course
Python_ 3 CheatSheet
R for Statistical Computing
Beginning Haskell, Dive In, Its Not That Scary!
JDD2015: Functional programing and Event Sourcing - a pair made in heaven - e...
Python : Regular expressions
Mementopython3 english
Datastructures in python
Ad

Viewers also liked (6)

PDF
Bio ontologies and semantic technologies
PPTX
P1 3 2017_python_exercises
PPTX
PPTX
T5 2017 database_searching_v_upload
PPTX
P1 2017 python
Bio ontologies and semantic technologies
P1 3 2017_python_exercises
T5 2017 database_searching_v_upload
P1 2017 python
Ad

Similar to P3 2017 python_regexes (20)

PPTX
P3 2018 python_regexes
PPTX
2016 bioinformatics i_python_part_3_io_and_strings_wim_vancriekinge
PPTX
Pythonlearn-11-Regex.pptx
PPTX
Open course(programming languages) 20150121
PDF
Module 3 - Regular Expressions, Dictionaries.pdf
DOCX
Python - Regular Expressions
PDF
Python regular expressions
PDF
Python - File operations & Data parsing
PPTX
unit-4 regular expression.pptx
PDF
Python (regular expression)
PDF
regular-expression.pdf
PPTX
Regular expressions,function and glob module.pptx
PDF
Python Regular Expressions
PPTX
Python- Regular expression
PPT
Chapter Two(1)
PDF
Python - Lecture 7
PPTX
Python advanced 2. regular expression in python
PPTX
Regular Expressions in Python.pptx
PPT
Adv. python regular expression by Rj
PPTX
regex.pptx
P3 2018 python_regexes
2016 bioinformatics i_python_part_3_io_and_strings_wim_vancriekinge
Pythonlearn-11-Regex.pptx
Open course(programming languages) 20150121
Module 3 - Regular Expressions, Dictionaries.pdf
Python - Regular Expressions
Python regular expressions
Python - File operations & Data parsing
unit-4 regular expression.pptx
Python (regular expression)
regular-expression.pdf
Regular expressions,function and glob module.pptx
Python Regular Expressions
Python- Regular expression
Chapter Two(1)
Python - Lecture 7
Python advanced 2. regular expression in python
Regular Expressions in Python.pptx
Adv. python regular expression by Rj
regex.pptx

More from Prof. Wim Van Criekinge (20)

PPTX
2020 02 11_biological_databases_part1
PPTX
2019 03 05_biological_databases_part5_v_upload
PPTX
2019 03 05_biological_databases_part4_v_upload
PPTX
2019 03 05_biological_databases_part3_v_upload
PPTX
2019 02 21_biological_databases_part2_v_upload
PPTX
2019 02 12_biological_databases_part1_v_upload
PPTX
P7 2018 biopython3
PPTX
P6 2018 biopython2b
PPTX
P4 2018 io_functions
PPTX
T1 2018 bioinformatics
PPTX
P1 2018 python
PDF
Bio ontologies and semantic technologies[2]
PPTX
2018 05 08_biological_databases_no_sql
PPTX
2018 03 27_biological_databases_part4_v_upload
PPTX
2018 03 20_biological_databases_part3
PPTX
2018 02 20_biological_databases_part2_v_upload
PPTX
2018 02 20_biological_databases_part1_v_upload
PPTX
P7 2017 biopython3
PPTX
P6 2017 biopython2
PPTX
Van criekinge 2017_11_13_rodebiotech
2020 02 11_biological_databases_part1
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part3_v_upload
2019 02 21_biological_databases_part2_v_upload
2019 02 12_biological_databases_part1_v_upload
P7 2018 biopython3
P6 2018 biopython2b
P4 2018 io_functions
T1 2018 bioinformatics
P1 2018 python
Bio ontologies and semantic technologies[2]
2018 05 08_biological_databases_no_sql
2018 03 27_biological_databases_part4_v_upload
2018 03 20_biological_databases_part3
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part1_v_upload
P7 2017 biopython3
P6 2017 biopython2
Van criekinge 2017_11_13_rodebiotech

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Institutional Correction lecture only . . .
PDF
Business Ethics Teaching Materials for college
PDF
Complications of Minimal Access Surgery at WLH
PDF
Basic Mud Logging Guide for educational purpose
PDF
Insiders guide to clinical Medicine.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Classroom Observation Tools for Teachers
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Institutional Correction lecture only . . .
Business Ethics Teaching Materials for college
Complications of Minimal Access Surgery at WLH
Basic Mud Logging Guide for educational purpose
Insiders guide to clinical Medicine.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
2.FourierTransform-ShortQuestionswithAnswers.pdf
Week 4 Term 3 Study Techniques revisited.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Renaissance Architecture: A Journey from Faith to Humanism
O5-L3 Freight Transport Ops (International) V1.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf

P3 2017 python_regexes

  • 4. Recap if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue Strings
  • 5. Lists • Flexible arrays, not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  • 6. Dictionaries • Hash tables, "associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  • 8. if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue Strings REGULAR EXPRESSIONS
  • 9. Regular Expressions http://en.wikipedia.org/wiki/Regular_expression In computing, a regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor. Really clever "wild card" expressions for matching and parsing strings.
  • 10. Understanding Regular Expressions • Very powerful and quite cryptic • Fun once you understand them • Regular expressions are a language unto themselves • A language of "marker characters" - programming with characters • It is kind of an "old school" language - compact
  • 11. Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line . Matches any character s Matches whitespace S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a chracter one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z0-9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end
  • 12. The Regular Expression Module • Before you can use regular expressions in your program, you must import the library using "import re" • You can use re.search() to see if a string matches a regular expression similar to using the find() method for strings • You can use re.findall() extract portions of a string that match your regular expression similar to a combination of find() and slicing: var[5:10]
  • 13. Wild-Card Characters • The dot character matches any character • If you add the asterisk character, the character is "any number of times" ^X.*: Match the start of the line Match any character Many times
  • 14. Matching and Extracting Data • The re.search() returns a True/False depending on whether the string matches the regular expression • If we actually want the matching strings to be extracted, we use re.findall() >>> import re >>> x = 'My 2 favorite numbers are 19 and 42' >>> y = re.findall('[0-9]+',x) >>> print y ['2', '19', '42']
  • 15. Warning: Greedy Matching • The repeat characters (* and +) push outward in both directions (greedy) to match the largest possible string >>> import re >>> x = 'From: Using the : character' >>> y = re.findall('^F.+:', x) >>> print y ['From: Using the :'] ^F.+: One or more characters First character in the match is an F Last character in the match is a :
  • 16. Non-Greedy Matching • Not all regular expression repeat codes are greedy! If you add a ? character - the + and * chill out a bit... >>> import re >>> x = 'From: Using the : character' >>> y = re.findall('^F.+?:', x) >>> print y ['From:'] ^F.+?: One or more characters but not greedily First character in the match is an F Last character in the match is a :
  • 17. Fine Tuning String Extraction • Parenthesis are not part of the match - but they tell where to start and stop what string to extract From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 >>> y = re.findall('S+@S+',x) >>> print y ['stephen.marquard@uct.ac.za'] >>> y = re.findall('^From (S+@S+)',x) >>> print y ['stephen.marquard@uct.ac.za'] ^From (S+@S+)
  • 18. The Double Split Version • Sometimes we split a line one way and then grab one of the pieces of the line and split that piece again From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 words = line.split() email = words[1] pieces = email.split('@') print pieces[1] stephen.marquard@uct.ac.za ['stephen.marquard', 'uct.ac.za'] 'uct.ac.za'
  • 19. The Regex Version From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 import re lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:1 y = re.findall('@([^ ]*)',lin) print y['uct.ac.za'] '@([^ ]*)' Look through the string until you find an at-sign Match non-blank character Match many of them
  • 20. Escape Character • If you want a special regular expression character to just behave normally (most of the time) you prefix it with '' >>> import re >>> x = 'We just received $10.00 for cookies.' >>> y = re.findall('$[0-9.]+',x) >>> print y ['$10.00'] $[0-9.]+ A digit or periodA real dollar sign At least one or more
  • 21. Real world problems • Match IP Addresses, email addresses, URLs • Match balanced sets of parenthesis • Substitute words • Tokenize • Validate • Count • Delete duplicates • Natural Language processing
  • 24. RE in Python • Unleash the power - built-in re module • Functions – to compile patterns • compile – to perform matches • match, search, findall, finditer – to perform operations on match object • group, start, end, span – to substitute • sub, subn • - Metacharacters
  • 25. Examples 1 pattern = re.compile(r"tes") print (pattern.findall("test testing"))
  • 26. Examples 2 import re dna = "ATCGCGAATTCAC" if re.search(r"GAATTC", dna): print("restriction site found!")
  • 27. Examples 3 scientific_name = "Homo sapiens" m = re.search("(.+) (.+)", scientific_name) if m: genus = m.group(1) species = m.group(2) print("genus is " + genus + ", species is " + species)
  • 28. Examples 4 regex = r"([a-zA-Z]+) d+" #finditer() returns an iterator that produces Match instances instead of the strings returned by findall() matches = re.finditer(regex, "June 24, August 9, Dec 12") for match in matches: print(match) print ("Match at index:",match.group(0),match.group(1),match.start(), match.end())
  • 29. Examples 5 text = 'abbaaabbbbaaaaa' pattern = 'ab' for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))
  • 30. Exercise 1 1. Which of following 4 sequences (seq1/2/3/4) a) contains a “Galactokinase signature” b) How many of them? http://us.expasy.org/prosite/
  • 31. >SEQ1 MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVA M MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR >SEQ2 MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ >SEQ3 MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLA CISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGS SSGH TSTTL >SEQ4 MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA Oefening 1