SlideShare a Scribd company logo
Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
What are Regular Expressions? Very small language for describing text. Not a programming language. Incredibly powerful tool for search/replace operations. Arcane art. Ubiquitous.
Why Use Regular Expressions? Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary How many times does “sing” appear in a text  in all tenses and conjugations? Reformatting dirty data Validating input. Command line work – listing files, grepping log files
The Basics A regex is a pattern enclosed within delimiters. Most characters match themselves. /THATCamp/ is a regular expression that matches “THATCamp”. Slash is the delimiter enclosing the expression. “ THATCamp” is the pattern.
/at/ Matches strings with “a” followed by “t”. Athens aft atlas that hat at
/at/ Matches strings with “a” followed by “t”. Athens aft at las th at h at at
Some Theory Finite State Machine for the regex /at/
Characters Matching is case sensitive.  Special characters: ( ) ^ $ { } [ ] \ | . + ? * To match a special character in your text, precede it with \ in your pattern: /ironic [sic]/ does not match “ironic [sic]” /ironic \[sic\]/ matches “ironic [sic]” Regular expressions can support Unicode.
Character Classes Characters within [ ] are choices for a single-character match. Think of a set operation, or a type of  or . Order within the set is unimportant. /x[01]/ matches “x0” and “x1”. /[10][23]/ matches “02”, “03”, “12” and “13”. Initial^ negates the class:  /[^45]/ matches all characters except 4 or 5.
/[ch]at/ Matches strings with “c” or “h”, followed by “a”, followed by “t”. phat fat cat chat at that
/[ch]at/ Matches strings with “c” or “h”, followed by “a”, followed by “t”. p hat fat cat c hat at t hat
Ranges Ranges define sets of characters within a class. /[1-9]/ matches any non-zero digit. /[a-zA-Z]/ matches any letter. /[12][0-9]/ matches numbers between 10 and 29.
Shortcuts [^\t\n\r\f\v ] not space \S [^\n] (depends on mode) everything . [a-zA-Z0-9_] word \w [^a-zA-Z0-9_] not word \W [0-9] digit \d [^0-9] not digit \D [\t\n\r\f\v ] space \s Equivalent Class Name Shortcut
/\d\d\d[- ]\d\d\d\d/ Matches strings with: Three digits Space or dash Four digits 653-6464x256 PE6-5000 713-342-7452 652.2648 234 1252 501-1234
/\d\d\d[- ]\d\d\d\d/ Matches strings with: Three digits Space or dash Four digits 653-6464 x256 PE6-5000 713- 342-7452 652.2648 234 1252 501-1234
Repeaters Symbols indicating that the preceding element of the pattern can repeat. /runs?/ matches runs or run /1\d*/ matches any number beginning with “1”. at least  n  times { n ,} no more than  m  times {, m } between  n  and  m  times { n , m } exactly  n { n } zero or more * one or more + zero or one ? Count Repeater
Repeaters Strings: 1: “at” 2: “art” 3: “arrrrt” 4: “aft” Patterns: A: /ar?t/ B: /a[fr]?t/  C: /ar*t/  D: /ar+t/  E: /a.*t/ F: /a.+t/ at least  n  times { n ,} no more than  m  times {, m } between  n  and  m  times { n , m } exactly  n { n } zero or more * one or more + zero or one ? Count Repeater
Repeaters /ar?t/ matches “at” and “art” but not “arrrt”. /a[fr]?t/ matches “at”, “art”, and “aft”. /ar*t/ matches “at”, “art”, and “arrrrt” /ar+t/ matches “art” and “arrrt” but not “at”. /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
Lab Session I http://gskinner.com/RegExr/ https://gist.github.com/922838 Match the titles “Mr.” and “Ms.”. Find all conjugations and tenses of “sing”. Find all places where more than one space follows punctuation.
Lab Reference at least  n  times { n ,} no more than  m  times {, m } between  n  and  m  times { n , m } exactly  n { n } zero or more * one or more + zero or one ? Count Repeater everything . not space \S space \s not word \W word \w not digit \D digit \d Name Shortcut
Anchors Anchors match between characters. Used to assert that the characters you’re matching must appear in a certain place. /\bat\b/ matches “at work” but not “batch”. raw end of string (rare) \z end of string \Z start of string \A not boundary \B word boundary \b end of line $ start of line ^ Matches Anchor
Alternation In Regex, | means “or”. You can put a full expression on the left and another full expression on the right. Either can match. /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
Grouping Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation. The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”. /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
Grouping Example What regular expression matches “eat”, “eats”, “ate” and “eaten”?
Grouping Example What regular expression matches “eat”, “eats”, “ate” and “eaten”? /eat(s|en)?|ate/ Add word boundary anchors to exclude “sate” and “eating”: /\b(eat(s|en)?|ate)\b/
Replacement Regex most often used for search/replace Syntax varies; most scripting languages and CLI tools use s/ pattern / replacement / . s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”. s/\bsheeps\b/sheep/ converts  “ sheepskin is made from sheeps” to “ sheepskin is made from sheep”
Capture During searches, ( … ) groups capture patterns for use in replacement. Special variables $1, $2, $3 etc. contain the capture. /(\d\d\d)-(\d\d\d\d)/ “123-4567” $1 contains “123” $2 contains “4567”
Capture How do you convert  “Smith, James” and “Jones, Sally” to  “James Smith” and “Sally Jones”?
Capture How do you convert  “ Smith, James” and “Jones, Sally” to  “ James Smith” and “Sally Jones”? s/(\w+), (\w+)/$2 $1/
Capture Given a file containing URLs, create a script that  wget s each URL: http://bit.ly/DHapiTRANSCRIBE becomes: wget “http://bit.ly/DHapiTRANSCRIBE”
Capture Given a file containing URLs, create a script that  wget s each URL: http://bit.ly/DHapiTRANSCRIBE   becomes wget “ http:// bit.ly/DHapiTRANSCRIBE ” s/^(.*)$/wget “$1”/
Lab Session II Convert all Miss and Mrs. to Ms. Convert infinitives to gerunds  “ to sing” -> “singing” Extract  last name, first name  from (title first name last name) Dr. Thelma Dunn Mr. Clay Shirky Dana Gray
Caveats Do not use regular expressions to parse (complicated) XML! Check the language/application-specific documentation: some common shortcuts are not universal.
Acknowledgments James Edward Gray II and Dana Gray Much of the structure and some of the wording of this presentation comes from http://www.slideshare.net/JamesEdwardGrayII/regular-expressions-7337223

More Related Content

PPTX
Regular Expression
ODP
Regular Expression
PPTX
Regular Expressions 101 Introduction to Regular Expressions
KEY
Regular Expressions 101
PPT
Regular Expressions
PPT
Regex Basics
PDF
Introducing Regular Expressions
PPTX
Regular expressions
Regular Expression
Regular Expression
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101
Regular Expressions
Regex Basics
Introducing Regular Expressions
Regular expressions

What's hot (20)

ODP
Regex Presentation
PPTX
Regular Expression (Regex) Fundamentals
PDF
Advanced regular expressions
PPTX
Adjacency And Incidence Matrix
PDF
Computer graphics lab manual
PPTX
Regular expressions
PPT
Query optimization
PDF
ασκήσεις παρουσίασης
PPTX
Query Optimization
PPT
7. Relational Database Design in DBMS
PPTX
Dijkstra s algorithm
PPTX
Sql Constraints
PPT
Regular expressions
PPTX
Where conditions and Operators in SQL
PPTX
Control Statement programming
PPTX
Ζωγραφίζω και γράφω το όνειρό μου, Α τάξη- Μπλέ όνειρα
PPTX
SQL JOIN.pptx
PPTX
Algorithm analysis and efficiency
PDF
漫談 CSS 架構方法 - 以 OOCSS, SMACSS, BEM 為例
PDF
τα αρθρα
Regex Presentation
Regular Expression (Regex) Fundamentals
Advanced regular expressions
Adjacency And Incidence Matrix
Computer graphics lab manual
Regular expressions
Query optimization
ασκήσεις παρουσίασης
Query Optimization
7. Relational Database Design in DBMS
Dijkstra s algorithm
Sql Constraints
Regular expressions
Where conditions and Operators in SQL
Control Statement programming
Ζωγραφίζω και γράφω το όνειρό μου, Α τάξη- Μπλέ όνειρα
SQL JOIN.pptx
Algorithm analysis and efficiency
漫談 CSS 架構方法 - 以 OOCSS, SMACSS, BEM 為例
τα αρθρα
Ad

Viewers also liked (11)

PPTX
Introduction to Regular Expressions
PDF
Lecture: Regular Expressions and Regular Languages
PPTX
3, regular expression
PPT
101 3.7 search text files using regular expressions
PPT
101 3.7 search text files using regular expressions
PDF
Presentation at RegX on Business Model Innovation and Speed Creation 20120904
PPT
Regular Expressions
PDF
Linux fundamental - Chap 06 regx
PPT
Unix command-line tools
PPT
Practical Example of grep command in unix
PPTX
Linux.ppt
Introduction to Regular Expressions
Lecture: Regular Expressions and Regular Languages
3, regular expression
101 3.7 search text files using regular expressions
101 3.7 search text files using regular expressions
Presentation at RegX on Business Model Innovation and Speed Creation 20120904
Regular Expressions
Linux fundamental - Chap 06 regx
Unix command-line tools
Practical Example of grep command in unix
Linux.ppt
Ad

Similar to Introduction to regular expressions (20)

PPT
Introduction to Regular Expressions RootsTech 2013
PPT
Bioinformatica 06-10-2011-p2 introduction
PPT
regular-expressions lecture 28-string regular expression
PDF
3.2 javascript regex
PPT
Class 5 - PHP Strings
PPTX
Bioinformatica p2-p3-introduction
PDF
Basta mastering regex power
PPTX
Unit 1-array,lists and hashes
PPT
Php Chapter 4 Training
PDF
Maxbox starter20
PPTX
Bioinformatics p2-p3-perl-regexes v2014
PPT
Introduction to Perl
PPTX
Strings,patterns and regular expressions in perl
PPTX
Unit 1-strings,patterns and regular expressions
KEY
Regular expressions
PDF
Regular expressions
PPT
Regular Expressions 2007
PDF
And now you have two problems. Ruby regular expressions for fun and profit by...
ODP
Looking for Patterns
PPTX
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Introduction to Regular Expressions RootsTech 2013
Bioinformatica 06-10-2011-p2 introduction
regular-expressions lecture 28-string regular expression
3.2 javascript regex
Class 5 - PHP Strings
Bioinformatica p2-p3-introduction
Basta mastering regex power
Unit 1-array,lists and hashes
Php Chapter 4 Training
Maxbox starter20
Bioinformatics p2-p3-perl-regexes v2014
Introduction to Perl
Strings,patterns and regular expressions in perl
Unit 1-strings,patterns and regular expressions
Regular expressions
Regular expressions
Regular Expressions 2007
And now you have two problems. Ruby regular expressions for fun and profit by...
Looking for Patterns
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf

Introduction to regular expressions

  • 1. Introduction to Regular Expressions Ben Brumfield THATCamp Texas 2011
  • 2. What are Regular Expressions? Very small language for describing text. Not a programming language. Incredibly powerful tool for search/replace operations. Arcane art. Ubiquitous.
  • 3. Why Use Regular Expressions? Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary How many times does “sing” appear in a text in all tenses and conjugations? Reformatting dirty data Validating input. Command line work – listing files, grepping log files
  • 4. The Basics A regex is a pattern enclosed within delimiters. Most characters match themselves. /THATCamp/ is a regular expression that matches “THATCamp”. Slash is the delimiter enclosing the expression. “ THATCamp” is the pattern.
  • 5. /at/ Matches strings with “a” followed by “t”. Athens aft atlas that hat at
  • 6. /at/ Matches strings with “a” followed by “t”. Athens aft at las th at h at at
  • 7. Some Theory Finite State Machine for the regex /at/
  • 8. Characters Matching is case sensitive. Special characters: ( ) ^ $ { } [ ] \ | . + ? * To match a special character in your text, precede it with \ in your pattern: /ironic [sic]/ does not match “ironic [sic]” /ironic \[sic\]/ matches “ironic [sic]” Regular expressions can support Unicode.
  • 9. Character Classes Characters within [ ] are choices for a single-character match. Think of a set operation, or a type of or . Order within the set is unimportant. /x[01]/ matches “x0” and “x1”. /[10][23]/ matches “02”, “03”, “12” and “13”. Initial^ negates the class: /[^45]/ matches all characters except 4 or 5.
  • 10. /[ch]at/ Matches strings with “c” or “h”, followed by “a”, followed by “t”. phat fat cat chat at that
  • 11. /[ch]at/ Matches strings with “c” or “h”, followed by “a”, followed by “t”. p hat fat cat c hat at t hat
  • 12. Ranges Ranges define sets of characters within a class. /[1-9]/ matches any non-zero digit. /[a-zA-Z]/ matches any letter. /[12][0-9]/ matches numbers between 10 and 29.
  • 13. Shortcuts [^\t\n\r\f\v ] not space \S [^\n] (depends on mode) everything . [a-zA-Z0-9_] word \w [^a-zA-Z0-9_] not word \W [0-9] digit \d [^0-9] not digit \D [\t\n\r\f\v ] space \s Equivalent Class Name Shortcut
  • 14. /\d\d\d[- ]\d\d\d\d/ Matches strings with: Three digits Space or dash Four digits 653-6464x256 PE6-5000 713-342-7452 652.2648 234 1252 501-1234
  • 15. /\d\d\d[- ]\d\d\d\d/ Matches strings with: Three digits Space or dash Four digits 653-6464 x256 PE6-5000 713- 342-7452 652.2648 234 1252 501-1234
  • 16. Repeaters Symbols indicating that the preceding element of the pattern can repeat. /runs?/ matches runs or run /1\d*/ matches any number beginning with “1”. at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
  • 17. Repeaters Strings: 1: “at” 2: “art” 3: “arrrrt” 4: “aft” Patterns: A: /ar?t/ B: /a[fr]?t/ C: /ar*t/ D: /ar+t/ E: /a.*t/ F: /a.+t/ at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater
  • 18. Repeaters /ar?t/ matches “at” and “art” but not “arrrt”. /a[fr]?t/ matches “at”, “art”, and “aft”. /ar*t/ matches “at”, “art”, and “arrrrt” /ar+t/ matches “art” and “arrrt” but not “at”. /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
  • 19. Lab Session I http://gskinner.com/RegExr/ https://gist.github.com/922838 Match the titles “Mr.” and “Ms.”. Find all conjugations and tenses of “sing”. Find all places where more than one space follows punctuation.
  • 20. Lab Reference at least n times { n ,} no more than m times {, m } between n and m times { n , m } exactly n { n } zero or more * one or more + zero or one ? Count Repeater everything . not space \S space \s not word \W word \w not digit \D digit \d Name Shortcut
  • 21. Anchors Anchors match between characters. Used to assert that the characters you’re matching must appear in a certain place. /\bat\b/ matches “at work” but not “batch”. raw end of string (rare) \z end of string \Z start of string \A not boundary \B word boundary \b end of line $ start of line ^ Matches Anchor
  • 22. Alternation In Regex, | means “or”. You can put a full expression on the left and another full expression on the right. Either can match. /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
  • 23. Grouping Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation. The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”. /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
  • 24. Grouping Example What regular expression matches “eat”, “eats”, “ate” and “eaten”?
  • 25. Grouping Example What regular expression matches “eat”, “eats”, “ate” and “eaten”? /eat(s|en)?|ate/ Add word boundary anchors to exclude “sate” and “eating”: /\b(eat(s|en)?|ate)\b/
  • 26. Replacement Regex most often used for search/replace Syntax varies; most scripting languages and CLI tools use s/ pattern / replacement / . s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”. s/\bsheeps\b/sheep/ converts “ sheepskin is made from sheeps” to “ sheepskin is made from sheep”
  • 27. Capture During searches, ( … ) groups capture patterns for use in replacement. Special variables $1, $2, $3 etc. contain the capture. /(\d\d\d)-(\d\d\d\d)/ “123-4567” $1 contains “123” $2 contains “4567”
  • 28. Capture How do you convert “Smith, James” and “Jones, Sally” to “James Smith” and “Sally Jones”?
  • 29. Capture How do you convert “ Smith, James” and “Jones, Sally” to “ James Smith” and “Sally Jones”? s/(\w+), (\w+)/$2 $1/
  • 30. Capture Given a file containing URLs, create a script that wget s each URL: http://bit.ly/DHapiTRANSCRIBE becomes: wget “http://bit.ly/DHapiTRANSCRIBE”
  • 31. Capture Given a file containing URLs, create a script that wget s each URL: http://bit.ly/DHapiTRANSCRIBE becomes wget “ http:// bit.ly/DHapiTRANSCRIBE ” s/^(.*)$/wget “$1”/
  • 32. Lab Session II Convert all Miss and Mrs. to Ms. Convert infinitives to gerunds “ to sing” -> “singing” Extract last name, first name from (title first name last name) Dr. Thelma Dunn Mr. Clay Shirky Dana Gray
  • 33. Caveats Do not use regular expressions to parse (complicated) XML! Check the language/application-specific documentation: some common shortcuts are not universal.
  • 34. Acknowledgments James Edward Gray II and Dana Gray Much of the structure and some of the wording of this presentation comes from http://www.slideshare.net/JamesEdwardGrayII/regular-expressions-7337223