SlideShare a Scribd company logo
Theory and Practice
of Data Cleaning
Regular Expressions: From Theory to Practice
Theory of Regular Expressions
• Base elements:
• ∅ empty set, ε empty string, and Σ alphabet of characters
• For regular expressions R, S, the following are regular expressions:
• R | S alternation
• R S concatenation
• R* Kleene star
• (R) parentheses (can be omitted with precedence rules)
• Regular languages …
• generated by regular (Type-3) grammars
• recognized (accepted) by a finite automaton
• expressed by regular expressions
Regular
Grammars
• Not very handy in practice …
• Regular expressions to the
rescue!
[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?
Example: floating point numbers such as -0.314159265e+1
… can be generated by a right regular grammar G with
N = {S, A,B,C,D,E,F}, Σ = {0,1,2,3,4,5,6,7,8,9,+,-,.,e},
Production rules P =
Introduction to Regular Expressions (Regex)
Theory & Practice
• Theory of regular expressions:
• Brief introduction where regular expressions come from …
• Practice of regular expressions:
• What you need to know to get started with regex in practice!
• Demonstration of regular expressions
Practice of Regular Expressions
• Use case: Extract (then transform) data from text
• pi = -0.314159265e+1
• e = 0.2718281828E+1
• This regex will do the trick: [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?
• Character set [ … ] matches any single character
• Optional element ... ? matches 0 or 1 occurrence
• Range [0-9] matches any single character in this range
• (Kleene) Star ... * matches 0 or more occurrences
• Dot . matches any character (execept line breaks)
• Escape character  ... take next character literally (no special meaning)
• Capturing group (...) group multiple tokens; capture group for backreference
Beware of False Negatives and False Positives
• False Negative
• your pattern doe not match … although it should!
• you will notice this problem first (missing match results)
• Remedy: you need to “relax” the regex, so it matches the desired strings
• False Positive
• your pattern does match … although it shouldn’t!
• you might not notice this at first (false matches may occur sporadically)
• Remedy: you need to “tighten” the regex, so it matches fewer strings
(avoiding the false matches)
RegEx Matching as a Sport: RegEx Golf
https://xkcd.com/1313/
Division of Labor:
RegEx for Syntax; Code for Semantics
• Getting “the right” regex can be quite a balancing act
• … making RegEx Golf a real sport
• Even if there is a (near) exact regex solution, it might be
really difficult to get right, debug, maintain, etc.
Division of Labor:
RegEx for Syntax; Code for Semantics
• Better: allow some false positives, then use code to check the semantics
è keep regex for what they’re best: syntactic patterns
è use some code to check the semantics of the match
• Usually much better in practice
• and sometimes the only option, even in theory
• Example: 02/29/2000 . Is that a valid (even if non-standard) date?
• if (year is not divisible by 4) then (it is a common year)
else if (year is not divisible by 100) then (it is a leap year)
else if (year is not divisible by 400) then (it is a common year)
else (it is a leap year)
Character Classes
• . match any character except newline
• w d s match a word, digit, whitespace character, respectively
• W D S match a non-word, non-digit, non-whitespace character
• [abc] any of a, b, or c
• [^abc] match a character other than a, b, or c
• [a-g] match a character between a, b, …, g
Anchors
• ^abc match abc at the start of the string
• abc$ match abc at the end of the string
• xyzb match xyz at a word boundary
• xyzB match xyz if not at a word boundary
Escaped Characters
• . *  escaped special characters
• t n r match a tab, linefeed, carriage return
• u00A9 unicode escaped ©
Groups
• ([0-9]+)s*([a-z]+) two capture group s
• 1 backreference to group #1
• 2 1 first group #2, then #1 (simple palindrome)
Using Groups for Transformations
• Groups and backreferences are often used in transformations
• (d{2})/(d{2})/(d{4}) three capture groups for MM/DD/YYYY
• $3-$1-$2 insert captured results as: YYYY-MM-DD
•
• Use for example in Python, OpenRefine, …
Summary Regular Expressions
• Powerful language for pattern matching, extraction, transformation
• Roots in computer science theory (formal languages)
• Widely used in practice and may “save the day”
• Data extraction, Data transformation è Data quality assessment & cleaning
• … acquired taste... addictive ... special powers
https://xkcd.com/208/

More Related Content

PDF
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Theory
PPTX
Lecture 4 python string (ewurc)
PPT
Regular expressions
PPTX
Compiler design syntax analysis
PPTX
Regular expressions
PPTX
Finaal application on regular expression
ODP
Regex Presentation
PPTX
Regular expressions
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Theory
Lecture 4 python string (ewurc)
Regular expressions
Compiler design syntax analysis
Regular expressions
Finaal application on regular expression
Regex Presentation
Regular expressions

Similar to Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice (20)

PPTX
Regular expressions
KEY
Regular Expressions 101
PPT
Introduction to Regular Expressions RootsTech 2013
PDF
Python - Lecture 7
PPT
Chapter-three automata and complexity theory.ppt
PPTX
Regular expressions
PDF
Construction of a predictive parsing table.pdf
PDF
/Regex makes me want to (weep_give up_(╯°□°)╯︵ ┻━┻)/i (for 2024 CascadiaPHP)
PPTX
Python advanced 2. regular expression in python
PDF
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
PPT
Regular Expressions grep and egrep
PPT
Regular expressions
PPT
Introduction to Regular Expressions
PPTX
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
PPTX
P3 2017 python_regexes
PPTX
Regular Expressions
PDF
An Introduction to Regular expressions
ODP
Introduction To Regex in Lasso 8.5
PPTX
Regular Expressions here we have .pptx
Regular expressions
Regular Expressions 101
Introduction to Regular Expressions RootsTech 2013
Python - Lecture 7
Chapter-three automata and complexity theory.ppt
Regular expressions
Construction of a predictive parsing table.pdf
/Regex makes me want to (weep_give up_(╯°□°)╯︵ ┻━┻)/i (for 2024 CascadiaPHP)
Python advanced 2. regular expression in python
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
Regular Expressions grep and egrep
Regular expressions
Introduction to Regular Expressions
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
P3 2017 python_regexes
Regular Expressions
An Introduction to Regular expressions
Introduction To Regex in Lasso 8.5
Regular Expressions here we have .pptx
Ad

More from Bertram Ludäscher (20)

PDF
The Skeptic’s Argumentation Game or: Well-Founded Explanations for Mere Mortals
PDF
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
PDF
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
PDF
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
PDF
[Flashback] Integration of Active and Deductive Database Rules
PDF
[Flashback] Statelog: Integration of Active & Deductive Database Rules
PDF
Answering More Questions with Provenance and Query Patterns
PDF
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
PDF
Which Model Does Not Belong: A Dialogue
PDF
From Workflows to Transparent Research Objects and Reproducible Science Tales
PDF
From Research Objects to Reproducible Science Tales
PDF
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
PDF
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
PDF
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
PDF
Dissecting Reproducibility: A case study with ecological niche models in th...
PDF
Incremental Recomputation: Those who cannot remember the past are condemned ...
PDF
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
PDF
An ontology-driven framework for data transformation in scientific workflows
PDF
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
PDF
Whole-Tale: The Experience of Research
The Skeptic’s Argumentation Game or: Well-Founded Explanations for Mere Mortals
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
Answering More Questions with Provenance and Query Patterns
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Which Model Does Not Belong: A Dialogue
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Research Objects to Reproducible Science Tales
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
Dissecting Reproducibility: A case study with ecological niche models in th...
Incremental Recomputation: Those who cannot remember the past are condemned ...
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
An ontology-driven framework for data transformation in scientific workflows
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Whole-Tale: The Experience of Research
Ad

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Lecture1 pattern recognition............
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Global journeys: estimating international migration
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
1_Introduction to advance data techniques.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
IB Computer Science - Internal Assessment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Lecture1 pattern recognition............
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Business Data Analytics.
Global journeys: estimating international migration
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Galatica Smart Energy Infrastructure Startup Pitch Deck
Launch Your Data Science Career in Kochi – 2025
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice

  • 1. Theory and Practice of Data Cleaning Regular Expressions: From Theory to Practice
  • 2. Theory of Regular Expressions • Base elements: • ∅ empty set, ε empty string, and Σ alphabet of characters • For regular expressions R, S, the following are regular expressions: • R | S alternation • R S concatenation • R* Kleene star • (R) parentheses (can be omitted with precedence rules) • Regular languages … • generated by regular (Type-3) grammars • recognized (accepted) by a finite automaton • expressed by regular expressions
  • 3. Regular Grammars • Not very handy in practice … • Regular expressions to the rescue! [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)? Example: floating point numbers such as -0.314159265e+1 … can be generated by a right regular grammar G with N = {S, A,B,C,D,E,F}, Σ = {0,1,2,3,4,5,6,7,8,9,+,-,.,e}, Production rules P =
  • 4. Introduction to Regular Expressions (Regex) Theory & Practice • Theory of regular expressions: • Brief introduction where regular expressions come from … • Practice of regular expressions: • What you need to know to get started with regex in practice! • Demonstration of regular expressions
  • 5. Practice of Regular Expressions • Use case: Extract (then transform) data from text • pi = -0.314159265e+1 • e = 0.2718281828E+1 • This regex will do the trick: [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)? • Character set [ … ] matches any single character • Optional element ... ? matches 0 or 1 occurrence • Range [0-9] matches any single character in this range • (Kleene) Star ... * matches 0 or more occurrences • Dot . matches any character (execept line breaks) • Escape character ... take next character literally (no special meaning) • Capturing group (...) group multiple tokens; capture group for backreference
  • 6. Beware of False Negatives and False Positives • False Negative • your pattern doe not match … although it should! • you will notice this problem first (missing match results) • Remedy: you need to “relax” the regex, so it matches the desired strings • False Positive • your pattern does match … although it shouldn’t! • you might not notice this at first (false matches may occur sporadically) • Remedy: you need to “tighten” the regex, so it matches fewer strings (avoiding the false matches)
  • 7. RegEx Matching as a Sport: RegEx Golf https://xkcd.com/1313/
  • 8. Division of Labor: RegEx for Syntax; Code for Semantics • Getting “the right” regex can be quite a balancing act • … making RegEx Golf a real sport • Even if there is a (near) exact regex solution, it might be really difficult to get right, debug, maintain, etc.
  • 9. Division of Labor: RegEx for Syntax; Code for Semantics • Better: allow some false positives, then use code to check the semantics è keep regex for what they’re best: syntactic patterns è use some code to check the semantics of the match • Usually much better in practice • and sometimes the only option, even in theory • Example: 02/29/2000 . Is that a valid (even if non-standard) date? • if (year is not divisible by 4) then (it is a common year) else if (year is not divisible by 100) then (it is a leap year) else if (year is not divisible by 400) then (it is a common year) else (it is a leap year)
  • 10. Character Classes • . match any character except newline • w d s match a word, digit, whitespace character, respectively • W D S match a non-word, non-digit, non-whitespace character • [abc] any of a, b, or c • [^abc] match a character other than a, b, or c • [a-g] match a character between a, b, …, g
  • 11. Anchors • ^abc match abc at the start of the string • abc$ match abc at the end of the string • xyzb match xyz at a word boundary • xyzB match xyz if not at a word boundary
  • 12. Escaped Characters • . * escaped special characters • t n r match a tab, linefeed, carriage return • u00A9 unicode escaped ©
  • 13. Groups • ([0-9]+)s*([a-z]+) two capture group s • 1 backreference to group #1 • 2 1 first group #2, then #1 (simple palindrome)
  • 14. Using Groups for Transformations • Groups and backreferences are often used in transformations • (d{2})/(d{2})/(d{4}) three capture groups for MM/DD/YYYY • $3-$1-$2 insert captured results as: YYYY-MM-DD • • Use for example in Python, OpenRefine, …
  • 15. Summary Regular Expressions • Powerful language for pattern matching, extraction, transformation • Roots in computer science theory (formal languages) • Widely used in practice and may “save the day” • Data extraction, Data transformation è Data quality assessment & cleaning • … acquired taste... addictive ... special powers https://xkcd.com/208/