SlideShare a Scribd company logo
Secrets of Regexp
      Hiro Asari
     Red Hat, Inc.
Let's Talk About
Regular Expressions
Let's Talk About
  Regular Expressions


• There is no regular expression
Let's Talk About
  Regular Expressions


• A good approximation as a name
Let's Talk About
     Regexp
Some people, when confronted
         with a problem, think, "I know,
          I'll use regular expressions."
        Now they have two problems.

                                                              Jaime Zawinski
                                                                 12 Aug, 1997




http://regex.info/blog/2006-09-15/247
http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-
problems.html

The point is not so much the evils of regular expressions, but the evils of overuse of it.
Formal Language
         Theory

• The Language L
• Over Alphabet Σ
Formal Language
          Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
Formal Language
          Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
Formal Language
          Theory

• Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
• Words over Σ: "a", "b", "ab", "aequafdhfad"
• Σ*: The set of all words over Σ
Formal Language
         over Σ

• A subset L of Σ* (with various properties)
• L can be finite, and enumerate well-formed
  words, but often infinite
Example

• Language L over Σ = {a,b}
• 'a' is a word
• a word may be obtained by appending 'ab'
  to an existing word
• only words thus formed are legal
Well-formed words
a
aab
aabab
Ill-formed words
b
aaaab
abb
Succinctly…


• a(ab)*
Expression

• Textual representation of the formal
  language against which an input is tested
  whether it is a well-formed word in that
  language
Regular Languages
• ∅ (empty language) is regular
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the
  singleton language {a} is a regular language.
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the
  singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B
  (union), A•B (concatenation), and A*
  (Kleene star) are regular languages
Regular Languages
• ∅ (empty language) is regular
• For each a ∈ Σ (a belongs to Σ), the
  singleton language {a} is a regular language.
• If A and B are regular languages, then A ∪ B
  (union), A•B (concatenation), and A*
  (Kleene star) are regular languages
• No other languages over Σ are regular.
Regular Expressions


• Expressions of regular languages
Regular Expressions



              ot
• Expressions of regular languages
             N
Regular? Expressions

• It turns out that some expressions are
  more powerful and expresses non-regular
  languages
• Language of 'squares': (.*)1
 • a, aa, aaaa, WikiWiki
How does Regexp
        work?

• Build a finite state automaton representing
  a given regular expression
• Feed the String to the regular expression
  and see if the match succeeds
a




a
ab*




 a

      b
.*




.
a$




a        $
a?




a

     ε
a|b



a



b
(ab|c)



a            b



      c
(ab+|c)

       b

a             b



       c
Match is attempted at
every character, left to
        right
/a$/
         zyxwvutsrqponmlkjihgfedcba
         ^




Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
/a$/
         zyxwvutsrqponmlkjihgfedcba
         ^
         zyxwvutsrqponmlkjihgfedcba
           ^




Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
/a$/
         zyxwvutsrqponmlkjihgfedcba
         ^
         zyxwvutsrqponmlkjihgfedcba
           ^
         zyxwvutsrqponmlkjihgfedcba
             ^




Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
/a$/
         zyxwvutsrqponmlkjihgfedcba
         ^
         zyxwvutsrqponmlkjihgfedcba
           ^
         zyxwvutsrqponmlkjihgfedcba
             ^
         zyxwvutsrqponmlkjihgfedcba
               ^




Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
/a$/
         zyxwvutsrqponmlkjihgfedcba
         ^
         zyxwvutsrqponmlkjihgfedcba
           ^
         zyxwvutsrqponmlkjihgfedcba
             ^
         zyxwvutsrqponmlkjihgfedcba
               ^
         ⋮
         zyxwvutsrqponmlkjihgfedcba
                                  ^




Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward
to the end of the line
^s*(.*)s*$
         abc d a dfadg
^
     abc d a dfadg
 ^
      abc d a dfadg
     ^
      abc d a dfadg
      ^

# matches 'abc d a dfadg   '
a?a?a?…a?aaa…a
def pathological(n=5)
  Regexp.new('a?' * n + 'a' * n)
end


1.upto(40) do |n|
  print n, ": "
  print Time.now, "n" if 'a'*n =~ pathological(n)
end
a?a?a?aaa
aaa
^
Regexp tips
Use /x
UP_TO_256 = /b(?:25[0-5]   #   250-255
|2[0-4][0-9]                #   200-249
|1[0-9][0-9]                #   100-199
|[1-9][0-9]                 #   2-digit numbers
|[0-9])                     #   single-digit numbers
b/x

IPV4_ADDRESS = /#{UP_TO_256}(?:.#{UP_TO_256}){3}/
A, z for strings
       ^, $ for lines
• A: the beginning of the string
• z: the end of the string
• ^: after n
• $: before n
A, z for strings
       ^, $ for lines
• A: the beginning of the string
• z: the end of the string
• ^: after n
• $: before n                      always in Ruby
What's the problem?




also note the difference in what /m means
What's the problem?
         #! /usr/bin/env perl
         $a = "abcndef";
         if ($a =~ /^d/) {
           print "yesn";
         }
         if ($a =~ /^d/m) {
           print "yes nown";
         }
         # prints 'yes now'




also note the difference in what /m means
What's the problem?
         #! /usr/bin/env ruby

         a = "abcndef";
         if (a =~ /^d/)
           p "yes"
         end




http://guides.rubyonrails.org/security.html#regular-expressions
Security Implications
         class File < ActiveRecord::Base
           validates :name, :format => /^[w.-+]+$/
         end




http://guides.rubyonrails.org/security.html#regular-expressions
file.txt%0A<script>alert(‘hello’)</script>
file.txt%0A<script>alert(‘hello’)</script>
file.txtn<script>alert(‘hello’)</script>
file.txtn<script>alert(‘hello’)</script>


             /^[w.-+]+$/
file.txtn<script>alert(‘hello’)</script>


             /^[w.-+]+$/



            Match succeeds
    ActiveRecord validation succeeds
file.txtn<script>alert(‘hello’)</script>


            /A[w.-+]+z/
file.txtn<script>alert(‘hello’)</script>


            /A[w.-+]+z/



               Match fails
       ActiveRecord validation fails
Prefer Character Class
     to Alterations
require 'benchmark'

# simple benchmark for alternations and character class

n = 5_000

str = 'cafebabedeadbeef'*5_000

Benchmark.bmbm do |x|
     x.report('alternation') do
          str =~ /^(a|b|c|d|e|f)+$/
     end
     x.report('character class') do
          str =~ /^[a-f]+$/
     end
end
Benchmarks
Ruby 1.8.7
                      user     system      total         real
alternation       0.030000   0.010000   0.040000 (   0.036702)
character class   0.000000   0.000000   0.000000 (   0.004704)

Ruby 2.0.0
                      user     system      total         real
alternation       0.020000   0.010000   0.030000 (   0.023139)
character class   0.000000   0.000000   0.000000 (   0.009641)

JRuby 1.7.4.dev
                      user     system      total       real
alternation       0.030000   0.000000   0.030000 ( 0.021000)
character class   0.010000   0.000000   0.010000 ( 0.007000)
Beware of Character
                 Classes
         # case-insensitively match any non-word character…

         # one is unlike the others
         'r' =~ /(?i:[W])/
         's' =~ /(?i:[W])/     matches, even if 's' is a word character
         't' =~ /(?i:[W])/




https://bugs.ruby-lang.org/issues/4044
/^1?$|^(11+?)1+$/
/^1?$|^(11+?)1+$/
    Matches '1' or ''
/^1?$|^(11+?)1+$/
Non-greedily match 2 or more 1's
/^1?$|^(11+?)1+$/

1 or more additional times
/^1?$|^(11+?)1+$/

matches a composite number
/^1?$|^(11+?)1+$/
Matches a string of 1's if and only
if there are a non-prime # of 1's
Integer#prime?
          class Integer
            def prime?
              "1" * self !~ /^1?$|^(11+?)1+$/
            end
          end




                         No performance guarantee




Attributed a Perl hacker Abigail
• @hiro_asari
• Github: BanzaiMan

More Related Content

PPT
Regular Expressions
KEY
Regular Expressions 101
PDF
Working with text, Regular expressions
PDF
From android/java to swift (1)
PPTX
Introduction to Regular Expressions
PPT
Introduction to regular expressions
PPTX
Regular Expression
Regular Expressions
Regular Expressions 101
Working with text, Regular expressions
From android/java to swift (1)
Introduction to Regular Expressions
Introduction to regular expressions
Regular Expression

What's hot (7)

PDF
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
PPTX
Bioinformatics p2-p3-perl-regexes v2014
PPT
Introduction to Regular Expressions RootsTech 2013
PPTX
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
PDF
Intro to pattern matching in scala
PDF
Hw1 rubycalisthenics
ODP
Introduction to Perl
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatics p2-p3-perl-regexes v2014
Introduction to Regular Expressions RootsTech 2013
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Intro to pattern matching in scala
Hw1 rubycalisthenics
Introduction to Perl
Ad

Similar to Regexp secrets (20)

PDF
Maxbox starter20
PDF
How to check valid Email? Find using regex.
PDF
And now you have two problems. Ruby regular expressions for fun and profit by...
PPT
Regular Expressions grep and egrep
PPT
Perl Intro 5 Regex Matches And Substitutions
PDF
PDF
An Introduction to Regular expressions
KEY
Andrei's Regex Clinic
PPT
16 Java Regex
PPT
Regular Expressions
PDF
Regular expressions
PDF
Expresiones Regulares
PPTX
Regular expressions
PPTX
Regular Expression (Regex) Fundamentals
ODP
Regular Expressions and You
KEY
Regular expressions
DOCX
Quick start reg ex
ODP
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
PDF
How to check valid Email? Find using regex.
PPT
Chapter Two(1)
Maxbox starter20
How to check valid Email? Find using regex.
And now you have two problems. Ruby regular expressions for fun and profit by...
Regular Expressions grep and egrep
Perl Intro 5 Regex Matches And Substitutions
An Introduction to Regular expressions
Andrei's Regex Clinic
16 Java Regex
Regular Expressions
Regular expressions
Expresiones Regulares
Regular expressions
Regular Expression (Regex) Fundamentals
Regular Expressions and You
Regular expressions
Quick start reg ex
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
How to check valid Email? Find using regex.
Chapter Two(1)
Ad

More from Hiro Asari (7)

PDF
JRuby: Enhancing Java Developers' Lives
PDF
JRuby and You
PDF
Spring into rails
PDF
Rubyを持て、世界に出よう
PDF
PDF
Using Java from Ruby with JRuby IRB
PDF
JRuby, Ruby, Rails and You on the Cloud
JRuby: Enhancing Java Developers' Lives
JRuby and You
Spring into rails
Rubyを持て、世界に出よう
Using Java from Ruby with JRuby IRB
JRuby, Ruby, Rails and You on the Cloud

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Spectroscopy.pptx food analysis technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectroscopy.pptx food analysis technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Chapter 3 Spatial Domain Image Processing.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx

Regexp secrets

  • 1. Secrets of Regexp Hiro Asari Red Hat, Inc.
  • 3. Let's Talk About Regular Expressions • There is no regular expression
  • 4. Let's Talk About Regular Expressions • A good approximation as a name
  • 6. Some people, when confronted with a problem, think, "I know, I'll use regular expressions." Now they have two problems. Jaime Zawinski 12 Aug, 1997 http://regex.info/blog/2006-09-15/247 http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two- problems.html The point is not so much the evils of regular expressions, but the evils of overuse of it.
  • 7. Formal Language Theory • The Language L • Over Alphabet Σ
  • 8. Formal Language Theory • Alphabet Σ={a, b, c, d, e, …, z, λ} (example)
  • 9. Formal Language Theory • Alphabet Σ={a, b, c, d, e, …, z, λ} (example) • Words over Σ: "a", "b", "ab", "aequafdhfad"
  • 10. Formal Language Theory • Alphabet Σ={a, b, c, d, e, …, z, λ} (example) • Words over Σ: "a", "b", "ab", "aequafdhfad" • Σ*: The set of all words over Σ
  • 11. Formal Language over Σ • A subset L of Σ* (with various properties) • L can be finite, and enumerate well-formed words, but often infinite
  • 12. Example • Language L over Σ = {a,b} • 'a' is a word • a word may be obtained by appending 'ab' to an existing word • only words thus formed are legal
  • 16. Expression • Textual representation of the formal language against which an input is tested whether it is a well-formed word in that language
  • 17. Regular Languages • ∅ (empty language) is regular
  • 18. Regular Languages • ∅ (empty language) is regular • For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language.
  • 19. Regular Languages • ∅ (empty language) is regular • For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language. • If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages
  • 20. Regular Languages • ∅ (empty language) is regular • For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular language. • If A and B are regular languages, then A ∪ B (union), A•B (concatenation), and A* (Kleene star) are regular languages • No other languages over Σ are regular.
  • 21. Regular Expressions • Expressions of regular languages
  • 22. Regular Expressions ot • Expressions of regular languages N
  • 23. Regular? Expressions • It turns out that some expressions are more powerful and expresses non-regular languages • Language of 'squares': (.*)1 • a, aa, aaaa, WikiWiki
  • 24. How does Regexp work? • Build a finite state automaton representing a given regular expression • Feed the String to the regular expression and see if the match succeeds
  • 25. a a
  • 26. ab* a b
  • 27. .* .
  • 28. a$ a $
  • 29. a? a ε
  • 31. (ab|c) a b c
  • 32. (ab+|c) b a b c
  • 33. Match is attempted at every character, left to right
  • 34. /a$/ zyxwvutsrqponmlkjihgfedcba ^ Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  • 35. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  • 36. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  • 37. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  • 38. /a$/ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ zyxwvutsrqponmlkjihgfedcba ^ ⋮ zyxwvutsrqponmlkjihgfedcba ^ Regexp does not think, 'a$' can match only at the end of the line, so we should fast forward to the end of the line
  • 39. ^s*(.*)s*$ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ abc d a dfadg ^ # matches 'abc d a dfadg '
  • 40. a?a?a?…a?aaa…a def pathological(n=5) Regexp.new('a?' * n + 'a' * n) end 1.upto(40) do |n| print n, ": " print Time.now, "n" if 'a'*n =~ pathological(n) end
  • 43. Use /x UP_TO_256 = /b(?:25[0-5] # 250-255 |2[0-4][0-9] # 200-249 |1[0-9][0-9] # 100-199 |[1-9][0-9] # 2-digit numbers |[0-9]) # single-digit numbers b/x IPV4_ADDRESS = /#{UP_TO_256}(?:.#{UP_TO_256}){3}/
  • 44. A, z for strings ^, $ for lines • A: the beginning of the string • z: the end of the string • ^: after n • $: before n
  • 45. A, z for strings ^, $ for lines • A: the beginning of the string • z: the end of the string • ^: after n • $: before n always in Ruby
  • 46. What's the problem? also note the difference in what /m means
  • 47. What's the problem? #! /usr/bin/env perl $a = "abcndef"; if ($a =~ /^d/) { print "yesn"; } if ($a =~ /^d/m) { print "yes nown"; } # prints 'yes now' also note the difference in what /m means
  • 48. What's the problem? #! /usr/bin/env ruby a = "abcndef"; if (a =~ /^d/) p "yes" end http://guides.rubyonrails.org/security.html#regular-expressions
  • 49. Security Implications class File < ActiveRecord::Base   validates :name, :format => /^[w.-+]+$/ end http://guides.rubyonrails.org/security.html#regular-expressions
  • 54. file.txtn<script>alert(‘hello’)</script> /^[w.-+]+$/ Match succeeds ActiveRecord validation succeeds
  • 56. file.txtn<script>alert(‘hello’)</script> /A[w.-+]+z/ Match fails ActiveRecord validation fails
  • 57. Prefer Character Class to Alterations require 'benchmark' # simple benchmark for alternations and character class n = 5_000 str = 'cafebabedeadbeef'*5_000 Benchmark.bmbm do |x| x.report('alternation') do str =~ /^(a|b|c|d|e|f)+$/ end x.report('character class') do str =~ /^[a-f]+$/ end end
  • 58. Benchmarks Ruby 1.8.7 user system total real alternation 0.030000 0.010000 0.040000 ( 0.036702) character class 0.000000 0.000000 0.000000 ( 0.004704) Ruby 2.0.0 user system total real alternation 0.020000 0.010000 0.030000 ( 0.023139) character class 0.000000 0.000000 0.000000 ( 0.009641) JRuby 1.7.4.dev user system total real alternation 0.030000 0.000000 0.030000 ( 0.021000) character class 0.010000 0.000000 0.010000 ( 0.007000)
  • 59. Beware of Character Classes # case-insensitively match any non-word character… # one is unlike the others 'r' =~ /(?i:[W])/ 's' =~ /(?i:[W])/ matches, even if 's' is a word character 't' =~ /(?i:[W])/ https://bugs.ruby-lang.org/issues/4044
  • 61. /^1?$|^(11+?)1+$/ Matches '1' or ''
  • 63. /^1?$|^(11+?)1+$/ 1 or more additional times
  • 65. /^1?$|^(11+?)1+$/ Matches a string of 1's if and only if there are a non-prime # of 1's
  • 66. Integer#prime? class Integer def prime? "1" * self !~ /^1?$|^(11+?)1+$/ end end No performance guarantee Attributed a Perl hacker Abigail