SlideShare a Scribd company logo
RegularExpressions
Regular
Expressions
3
The Purpose of REX
• Regular expressions are the main way to match patterns
within strings or text. For example, finding pieces of text
within a larger document, or finding a rule within a code
sequence.
• There are 3 main methods that use REX:
1. matching (which returns TRUE if a match is found
and FALSE if no match is found.
2. search or substitution, which substitutes one pattern
of characters for another within a string
3. split, separates a string into a series of sub-strings
• REX are composed of chars, character classes, groups,
meta-characters, quantifiers, and assertions.
4
Imagine Literal Matching
5
Literal Matching 2
•
6
But don't end like this
7
Match Operator
• If you want to determine whether a string matches a
particular pattern, the basic syntax is:
Regex regex = new Regex(@"d+");
Match match = regex.Match("Regex 55 Telex");
if (match.Success) {
• Two important things here: "d+" is the regex. It means
“search the string for the pattern between the "" (d+
here). In the following we use standard Perl notation.
• Simple matching returns TRUE or FALSE.
8
Basic Quantifiers
• Quantifiers are placed after the character you want to match.
• * means 0 or more of the preceding character
• + means 1 or more
• ? Means 0 or 1
• For example:
my $str = “AACCGG”;
$str =~ /A+/; # matches AA
$str =~ /T+/; # no match
$str =~ /T*/; # matches 0 or more T’s
$str =~ /Q*/; # matches: 0 or more Q’s
• Matching positive or negative 23:
$str =~ /-?23/; # i.e. 0 or 1 – sign
9
More Quantifiers
• You can specify an exact number of repeats of a
character to match using curly braces:
$str = “doggggy”;
$str =~ /dog{4}y/; # matches 4 g’s
$str =~ /dog{3}y/; # no match
$str =~ /dog{3}/; # match--no trailing “y” in the pattern.
• You can also specify a range by separating a minimum
and maximum by a comma within curly braces:
$str =~ /dog{1,5}y/; # matches 1,2, 3, 4, or 5 g’s
• You can also specify a min. number of chars to match by
putting a comma after the minimum number:
$str =~ /dog{3,}; # matches 3 or more g’s
10
Grouping with Parentheses
• If you want to match a certain number of repeats of a
group of characters, you can group the characters within
parentheses. For ex., /(cat){3}/ matches 3 reps of “cat” in
a row: “catcatcat”. However, /cat{3}/ matches “ca”
followed by 3 t’s: “cattt”.
• Parentheses also invoke the pattern matching memory,
to say capturing.
11
Basic Meta-characters
• Some characters mean something other than the literal character.
• For example, “+” means “1 or more of the preceding character.
What if you want to match a literal plus sign? To do this, escape the
+ by putting a backslash in front of it: + will match a + sign, but
nothing else.
• To match a literal backslash, use 2 of them: .
• Another important meta-character: “.” matches any character. Thus,
/ATG…UGA/ would match ATG followed by 3 additional characters
of any type, followed by UGA.
• Note that /.*/ matches any string, even the empty string. It means: 0
or more of any character”.
• To match a literal period, escape it: .
• The “.” doesn’t match a newline.
• List of 12 chars that need to be escaped:  | / ( ) [ ] { } ^ $ * + ? .
12
Basic Assertions
• An assertion is a statement about the position of the match pattern within a
string.
• The most common assertions are “^”, which signifies the beginning of a
string, and “$”, which signifies the end of the string.
• Example:
my $str = “The dog”;
$str =~ /dog/; # matches
$str =~ /^dog/; # doesn’t work: “d” must be the first character
$str =~ /dog$/; # works: “g” is the last character
• Another common assertion: “b” signifies the beginning or end of a word.
For example:
$str = “There is a dog”;
$str =~ /The/ ; # matches
$str =~ /Theb/ ; # doesn’t match because the “e” isn’t at the end of the
word
13
Character Classes
• A character class is a way of matching 1 character in the
string being searched to any of a number of characters
in the search pattern.
• Character classes are defined using square brackets.
So [135] matches any of 1, 3, or 5.
• A range of characters (based on ASCII order) can be
used in a character class: [0-7] matches any digit
between 0 and 7, and [a-z] matches any small (but not
capital) letter. Modifiers (i?) allowed.
14
More Character Classes
• To negate a char class, that is, to match any character
EXCEPT what is in the class, use the caret ^ as the first
symbol in the class. [^0-9] matches any character that
isn’t a digit. [^-0-9] ,matches any char that isn’t a hyphen
or a digit.
• Quantifiers can be used with character classes. [135]+
matches 1 or more of 1, 3, or 5 (in any combination).
[246]{8} matches 8 of 2, 4, and 6 in any combination.
Ex.:HEX: ExecRegExpr('^(0x)?[0-9A-F]+$',ast);
15
Preset Character Classes
• Several groups of characters are so widely used that
they are given special meanings. These don't need to be
put inside square brackets unless you want to include
other chars in the class.
• d = any digit = [0-9]
• s = white-space (spaces, tabs, newlines) = [ tn]
• w - word character = [a-zA-Z0-9_]
• The negation of these classes use the capital letters: D
= any non-digit, S = any non-white-space character, and
W = any non-word chars.
16
Alternatives
• Alternative match patterns are separated
by the “|” character. Thus:
$str = “My pet is a dog.”;
$str =~ /dog|cat|bird/; # matches “dog”
or “cat” or “bird”.
• Note: there is no need to group the chars
with parentheses. Use of a | (pipe) implies
all of the chars between delimiters.
17
Memory Capture
• It is possible to save part or all of the string that matches
the pattern. To do this, simply group chars to be saved in
parentheses. The matching string is saved in scalar vars
starting with $1.
$str = “The z number is z576890”;
$str =~ /is z(d+)/;
print $1; # prints “567890”
• Different variables are counted from left to right by the
position of the opening parenthesis:
/(the ((cat) (runs)))/ ;
captures: $1 = the cat runs; $2 = cat runs; $3 = cat;
$4 = runs. -->ex.
18
Greedy vs. Lazy Matching!?
• The regular expression engine does “greedy” matching by default. This
means that it attempts to match the maximum possible number of
characters, if given a choice. For example:
$str = “The dogggg”;
$str =~ /The (dog+)/;
This prints “dogggg” because “g+”, one or more g’s, is interpreted to mean
the maximum possible number of g’s.
• Greedy matching can cause problems with the use of quantifiers. Imagine
that you have a long DNA sequence and you try to match /ATG(.*)TAG/.
The “.*” matches 0 or more of any character. Greedy matching causes this
to take the entire sequence between the first ATG and the last TAG. This
could be a very long matched sequence.
• Lazy matching matches the minimal number of characters. It is turned on
by putting a question mark “?” after a quantifier. Using the ex. above,
$str =~ /The (dog+?)/; print $1; # prints “dog”
and /ATG(.*?)TAG/ captures everything between the first ATG and the first
TAG that follows. This is usually what you want to do with large sequences.
19
More on Languages
20
Real Examples
• 1. Finding blank lines. They might have a space
or tab on them. so use /^s*$/
• 2. Extracting sub-patterns by index number with
Text Analysis: captureSubString()
• 3. Code Analysis by SONAR Metrics
• 4. Extract Weather Report with JSON
• 5. Get Exchange Rate from HTML
21
More Real REX
• 6. POP Song Finder
Also, there are some common numerical/letter
mixed expressions: 1st for first, for ex. So, w by
itself won’t match everything that we consider a
word in common English.
• 7. Decorate URL' s (big REX called TREX)
Part of hyper-links found must be included into
visible part of URL, for ex.
'http://soft.ch/index.htm' will be decorated as
'<ahref="http://soft.ch/index.htm">soft.ch</a>'.
22
Last Example (Lost)
• const
• URLTemplate =
• '(?i)'
• + '('
• + '(FTP|HTTP)://' // Protocol
• + '|www.)' // trick to catch links without
• // protocol - by detecting of starting 'www.'
• + '([wd-]+(.[wd-]+)+)' // TCP addr or domain name
• + '(:dd?d?d?d?)?' // port number
• + '(((/[%+wd-.]*)+)*)' // unix path
• + '(?[^s=&]+=[^s=&]+(&[^s=&]+=[^s=&]+)*)?'
• // request (GET) params
• + '(#[wd-%+]+)?'; // bookmark
23
Be aware of
• //Greedy or Lazy Pitfall
• Writeln(ReplaceRegExpr('<GTA>(.*?)<TGA>',
'DNA:Test
<GTA>TGAAUUTGA<TGA>GTUUGGGAAACCCA<TGA>-sign','',true));
• //Alarm Range Pitfall {0-255}
• writeln(botoStr(ExecRegExpr('[0-255]+','555'))); //true negative (str false)
• writeln(botoStr(ExecRegExpr('[/D]+','123'))); //false positive (rex false)
{stable design is to consider what it should NOT match}
• //Optional Pitfall - to much options 0..n {empty numbs}
• writeln(botoStr(ExecRegExpr('^d*$','')));
• Regular expressions don’t work very well with nested
delimiters or other tree-like data structures, such as in an
HTML table or an XML document.
24
Conclusion
* a match the character a
* a|b match either a or b
* a? match a or no a (optionality)
* a* match any number of a or no a (optional with repetition)
* a+ match one or more a (required with repetition)
* . match any one character (tab, space or visible char)
* (abc) match characters a, b and c in that order
* [abc] match any one of a, b, c (no order)
* [a-g] match any letter from a to g
* d match any digit [0-9]
* a match any letter [A-Za-z]
* w match any letter or digit [0-9A-Za-z]
* t match a tab (#9)
* n match a newline (#10 or #13)
* b match space (#32) or tab (#9)
* ^ match start of string - * $ matches end of string
25
You choose
• function StripTags2(const S: string): string;
• var
• Len, i, APos: Integer;
• begin
• Len:= Length(S);
• i:= 0;
• Result:= '';
• while (i <= Len) do begin
• Inc(i);
• APos:= ReadUntil(i, len, '<', s);
• Result:= Result + Copy(S, i, APos-i);
• i:= ReadUntil(APos+1,len, '>',s);
• end;
• End;
•
• Writeln(ReplaceRegExpr ('<(.*?)>',
• '<p>This is text.<br/> This is line 2</p>','',true))
http://www.softwareschule.ch/maxbox.htm
26
Thanks a Lot!
https://github.com/maxkleiner/maXbox3/releases

More Related Content

PPT
Textpad and Regular Expressions
PPTX
Bioinformatics p2-p3-perl-regexes v2014
PPTX
Bioinformatica p2-p3-introduction
KEY
Andrei's Regex Clinic
ODP
Looking for Patterns
PPT
The Power of Regular Expression: use in notepad++
PDF
2013 - Andrei Zmievski: Clínica Regex
PPTX
Regular expression examples
Textpad and Regular Expressions
Bioinformatics p2-p3-perl-regexes v2014
Bioinformatica p2-p3-introduction
Andrei's Regex Clinic
Looking for Patterns
The Power of Regular Expression: use in notepad++
2013 - Andrei Zmievski: Clínica Regex
Regular expression examples

What's hot (20)

PPT
Regular Expressions grep and egrep
PDF
Python - Lecture 7
PDF
Python (regular expression)
PPTX
Regular Expression
PDF
3.2 javascript regex
PPT
Bioinformatica 06-10-2011-p2 introduction
PDF
Introduction_to_Regular_Expressions_in_R
PPT
Introduction to Regular Expressions
PPT
Introduction to regular expressions
ODP
Regular Expression
PPTX
Regular expression
ODP
Regular Expressions and You
PPTX
Processing Regex Python
PPTX
Regular expressions
PPTX
Regular Expressions 101 Introduction to Regular Expressions
PPTX
Java: Regular Expression
DOCX
15 practical grep command examples in linux
PDF
Maxbox starter20
PPT
Regular Expressions
PPTX
Regex lecture
Regular Expressions grep and egrep
Python - Lecture 7
Python (regular expression)
Regular Expression
3.2 javascript regex
Bioinformatica 06-10-2011-p2 introduction
Introduction_to_Regular_Expressions_in_R
Introduction to Regular Expressions
Introduction to regular expressions
Regular Expression
Regular expression
Regular Expressions and You
Processing Regex Python
Regular expressions
Regular Expressions 101 Introduction to Regular Expressions
Java: Regular Expression
15 practical grep command examples in linux
Maxbox starter20
Regular Expressions
Regex lecture
Ad

Similar to Basta mastering regex power (20)

PPT
regex.ppt
PPTX
Chapter 3: Introduction to Regular Expression
KEY
Regular Expressions 101
PPTX
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
PPTX
Unit 1-strings,patterns and regular expressions
PPTX
Strings,patterns and regular expressions in perl
PPT
Perl Intro 5 Regex Matches And Substitutions
ODP
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
PDF
Don't Fear the Regex - Northeast PHP 2015
PPT
Regular Expressions in PHP, MySQL by programmerblog.net
PDF
Working with text, Regular expressions
PDF
Don't Fear the Regex LSP15
PPTX
Regular Expression Crash Course
PDF
Regex startup
PDF
Regex - Regular Expression Basics
PDF
Don't Fear the Regex WordCamp DC 2017
PDF
Don't Fear the Regex - CapitalCamp/GovDays 2014
PPT
Chapter-three automata and complexity theory.ppt
PDF
Lecture 23
PPTX
Regular Expressions Boot Camp
regex.ppt
Chapter 3: Introduction to Regular Expression
Regular Expressions 101
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Unit 1-strings,patterns and regular expressions
Strings,patterns and regular expressions in perl
Perl Intro 5 Regex Matches And Substitutions
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Don't Fear the Regex - Northeast PHP 2015
Regular Expressions in PHP, MySQL by programmerblog.net
Working with text, Regular expressions
Don't Fear the Regex LSP15
Regular Expression Crash Course
Regex startup
Regex - Regular Expression Basics
Don't Fear the Regex WordCamp DC 2017
Don't Fear the Regex - CapitalCamp/GovDays 2014
Chapter-three automata and complexity theory.ppt
Lecture 23
Regular Expressions Boot Camp
Ad

More from Max Kleiner (20)

PDF
EKON28_ModernRegex_12_Regular_Expressions.pdf
PDF
EKON28_Maps_API_12_google_openstreetmaps.pdf
PDF
EKON26_VCL4Python.pdf
PDF
EKON26_Open_API_Develop2Cloud.pdf
PDF
maXbox_Starter91_SyntheticData_Implement
PDF
Ekon 25 Python4Delphi_MX475
PDF
EKON 25 Python4Delphi_mX4
PDF
maXbox Starter87
PDF
maXbox Starter78 PortablePixmap
PDF
maXbox starter75 object detection
PDF
BASTA 2020 VS Code Data Visualisation
PDF
EKON 24 ML_community_edition
PDF
maxbox starter72 multilanguage coding
PDF
EKON 23 Code_review_checklist
PDF
EKON 12 Running OpenLDAP
PDF
EKON 12 Closures Coding
PDF
NoGUI maXbox Starter70
PDF
maXbox starter69 Machine Learning VII
PDF
maXbox starter68 machine learning VI
PDF
maXbox starter67 machine learning V
EKON28_ModernRegex_12_Regular_Expressions.pdf
EKON28_Maps_API_12_google_openstreetmaps.pdf
EKON26_VCL4Python.pdf
EKON26_Open_API_Develop2Cloud.pdf
maXbox_Starter91_SyntheticData_Implement
Ekon 25 Python4Delphi_MX475
EKON 25 Python4Delphi_mX4
maXbox Starter87
maXbox Starter78 PortablePixmap
maXbox starter75 object detection
BASTA 2020 VS Code Data Visualisation
EKON 24 ML_community_edition
maxbox starter72 multilanguage coding
EKON 23 Code_review_checklist
EKON 12 Running OpenLDAP
EKON 12 Closures Coding
NoGUI maXbox Starter70
maXbox starter69 Machine Learning VII
maXbox starter68 machine learning VI
maXbox starter67 machine learning V

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Foundation of Data Science unit number two notes
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

Basta mastering regex power

  • 3. 3 The Purpose of REX • Regular expressions are the main way to match patterns within strings or text. For example, finding pieces of text within a larger document, or finding a rule within a code sequence. • There are 3 main methods that use REX: 1. matching (which returns TRUE if a match is found and FALSE if no match is found. 2. search or substitution, which substitutes one pattern of characters for another within a string 3. split, separates a string into a series of sub-strings • REX are composed of chars, character classes, groups, meta-characters, quantifiers, and assertions.
  • 6. 6 But don't end like this
  • 7. 7 Match Operator • If you want to determine whether a string matches a particular pattern, the basic syntax is: Regex regex = new Regex(@"d+"); Match match = regex.Match("Regex 55 Telex"); if (match.Success) { • Two important things here: "d+" is the regex. It means “search the string for the pattern between the "" (d+ here). In the following we use standard Perl notation. • Simple matching returns TRUE or FALSE.
  • 8. 8 Basic Quantifiers • Quantifiers are placed after the character you want to match. • * means 0 or more of the preceding character • + means 1 or more • ? Means 0 or 1 • For example: my $str = “AACCGG”; $str =~ /A+/; # matches AA $str =~ /T+/; # no match $str =~ /T*/; # matches 0 or more T’s $str =~ /Q*/; # matches: 0 or more Q’s • Matching positive or negative 23: $str =~ /-?23/; # i.e. 0 or 1 – sign
  • 9. 9 More Quantifiers • You can specify an exact number of repeats of a character to match using curly braces: $str = “doggggy”; $str =~ /dog{4}y/; # matches 4 g’s $str =~ /dog{3}y/; # no match $str =~ /dog{3}/; # match--no trailing “y” in the pattern. • You can also specify a range by separating a minimum and maximum by a comma within curly braces: $str =~ /dog{1,5}y/; # matches 1,2, 3, 4, or 5 g’s • You can also specify a min. number of chars to match by putting a comma after the minimum number: $str =~ /dog{3,}; # matches 3 or more g’s
  • 10. 10 Grouping with Parentheses • If you want to match a certain number of repeats of a group of characters, you can group the characters within parentheses. For ex., /(cat){3}/ matches 3 reps of “cat” in a row: “catcatcat”. However, /cat{3}/ matches “ca” followed by 3 t’s: “cattt”. • Parentheses also invoke the pattern matching memory, to say capturing.
  • 11. 11 Basic Meta-characters • Some characters mean something other than the literal character. • For example, “+” means “1 or more of the preceding character. What if you want to match a literal plus sign? To do this, escape the + by putting a backslash in front of it: + will match a + sign, but nothing else. • To match a literal backslash, use 2 of them: . • Another important meta-character: “.” matches any character. Thus, /ATG…UGA/ would match ATG followed by 3 additional characters of any type, followed by UGA. • Note that /.*/ matches any string, even the empty string. It means: 0 or more of any character”. • To match a literal period, escape it: . • The “.” doesn’t match a newline. • List of 12 chars that need to be escaped: | / ( ) [ ] { } ^ $ * + ? .
  • 12. 12 Basic Assertions • An assertion is a statement about the position of the match pattern within a string. • The most common assertions are “^”, which signifies the beginning of a string, and “$”, which signifies the end of the string. • Example: my $str = “The dog”; $str =~ /dog/; # matches $str =~ /^dog/; # doesn’t work: “d” must be the first character $str =~ /dog$/; # works: “g” is the last character • Another common assertion: “b” signifies the beginning or end of a word. For example: $str = “There is a dog”; $str =~ /The/ ; # matches $str =~ /Theb/ ; # doesn’t match because the “e” isn’t at the end of the word
  • 13. 13 Character Classes • A character class is a way of matching 1 character in the string being searched to any of a number of characters in the search pattern. • Character classes are defined using square brackets. So [135] matches any of 1, 3, or 5. • A range of characters (based on ASCII order) can be used in a character class: [0-7] matches any digit between 0 and 7, and [a-z] matches any small (but not capital) letter. Modifiers (i?) allowed.
  • 14. 14 More Character Classes • To negate a char class, that is, to match any character EXCEPT what is in the class, use the caret ^ as the first symbol in the class. [^0-9] matches any character that isn’t a digit. [^-0-9] ,matches any char that isn’t a hyphen or a digit. • Quantifiers can be used with character classes. [135]+ matches 1 or more of 1, 3, or 5 (in any combination). [246]{8} matches 8 of 2, 4, and 6 in any combination. Ex.:HEX: ExecRegExpr('^(0x)?[0-9A-F]+$',ast);
  • 15. 15 Preset Character Classes • Several groups of characters are so widely used that they are given special meanings. These don't need to be put inside square brackets unless you want to include other chars in the class. • d = any digit = [0-9] • s = white-space (spaces, tabs, newlines) = [ tn] • w - word character = [a-zA-Z0-9_] • The negation of these classes use the capital letters: D = any non-digit, S = any non-white-space character, and W = any non-word chars.
  • 16. 16 Alternatives • Alternative match patterns are separated by the “|” character. Thus: $str = “My pet is a dog.”; $str =~ /dog|cat|bird/; # matches “dog” or “cat” or “bird”. • Note: there is no need to group the chars with parentheses. Use of a | (pipe) implies all of the chars between delimiters.
  • 17. 17 Memory Capture • It is possible to save part or all of the string that matches the pattern. To do this, simply group chars to be saved in parentheses. The matching string is saved in scalar vars starting with $1. $str = “The z number is z576890”; $str =~ /is z(d+)/; print $1; # prints “567890” • Different variables are counted from left to right by the position of the opening parenthesis: /(the ((cat) (runs)))/ ; captures: $1 = the cat runs; $2 = cat runs; $3 = cat; $4 = runs. -->ex.
  • 18. 18 Greedy vs. Lazy Matching!? • The regular expression engine does “greedy” matching by default. This means that it attempts to match the maximum possible number of characters, if given a choice. For example: $str = “The dogggg”; $str =~ /The (dog+)/; This prints “dogggg” because “g+”, one or more g’s, is interpreted to mean the maximum possible number of g’s. • Greedy matching can cause problems with the use of quantifiers. Imagine that you have a long DNA sequence and you try to match /ATG(.*)TAG/. The “.*” matches 0 or more of any character. Greedy matching causes this to take the entire sequence between the first ATG and the last TAG. This could be a very long matched sequence. • Lazy matching matches the minimal number of characters. It is turned on by putting a question mark “?” after a quantifier. Using the ex. above, $str =~ /The (dog+?)/; print $1; # prints “dog” and /ATG(.*?)TAG/ captures everything between the first ATG and the first TAG that follows. This is usually what you want to do with large sequences.
  • 20. 20 Real Examples • 1. Finding blank lines. They might have a space or tab on them. so use /^s*$/ • 2. Extracting sub-patterns by index number with Text Analysis: captureSubString() • 3. Code Analysis by SONAR Metrics • 4. Extract Weather Report with JSON • 5. Get Exchange Rate from HTML
  • 21. 21 More Real REX • 6. POP Song Finder Also, there are some common numerical/letter mixed expressions: 1st for first, for ex. So, w by itself won’t match everything that we consider a word in common English. • 7. Decorate URL' s (big REX called TREX) Part of hyper-links found must be included into visible part of URL, for ex. 'http://soft.ch/index.htm' will be decorated as '<ahref="http://soft.ch/index.htm">soft.ch</a>'.
  • 22. 22 Last Example (Lost) • const • URLTemplate = • '(?i)' • + '(' • + '(FTP|HTTP)://' // Protocol • + '|www.)' // trick to catch links without • // protocol - by detecting of starting 'www.' • + '([wd-]+(.[wd-]+)+)' // TCP addr or domain name • + '(:dd?d?d?d?)?' // port number • + '(((/[%+wd-.]*)+)*)' // unix path • + '(?[^s=&]+=[^s=&]+(&[^s=&]+=[^s=&]+)*)?' • // request (GET) params • + '(#[wd-%+]+)?'; // bookmark
  • 23. 23 Be aware of • //Greedy or Lazy Pitfall • Writeln(ReplaceRegExpr('<GTA>(.*?)<TGA>', 'DNA:Test <GTA>TGAAUUTGA<TGA>GTUUGGGAAACCCA<TGA>-sign','',true)); • //Alarm Range Pitfall {0-255} • writeln(botoStr(ExecRegExpr('[0-255]+','555'))); //true negative (str false) • writeln(botoStr(ExecRegExpr('[/D]+','123'))); //false positive (rex false) {stable design is to consider what it should NOT match} • //Optional Pitfall - to much options 0..n {empty numbs} • writeln(botoStr(ExecRegExpr('^d*$',''))); • Regular expressions don’t work very well with nested delimiters or other tree-like data structures, such as in an HTML table or an XML document.
  • 24. 24 Conclusion * a match the character a * a|b match either a or b * a? match a or no a (optionality) * a* match any number of a or no a (optional with repetition) * a+ match one or more a (required with repetition) * . match any one character (tab, space or visible char) * (abc) match characters a, b and c in that order * [abc] match any one of a, b, c (no order) * [a-g] match any letter from a to g * d match any digit [0-9] * a match any letter [A-Za-z] * w match any letter or digit [0-9A-Za-z] * t match a tab (#9) * n match a newline (#10 or #13) * b match space (#32) or tab (#9) * ^ match start of string - * $ matches end of string
  • 25. 25 You choose • function StripTags2(const S: string): string; • var • Len, i, APos: Integer; • begin • Len:= Length(S); • i:= 0; • Result:= ''; • while (i <= Len) do begin • Inc(i); • APos:= ReadUntil(i, len, '<', s); • Result:= Result + Copy(S, i, APos-i); • i:= ReadUntil(APos+1,len, '>',s); • end; • End; • • Writeln(ReplaceRegExpr ('<(.*?)>', • '<p>This is text.<br/> This is line 2</p>','',true)) http://www.softwareschule.ch/maxbox.htm