SlideShare a Scribd company logo
Unicode
Regular Expressions

  s/�/�/g
       Nick Patch
    23 January 2013
Unicode Refresher

    Unicode attempts to support the
characters of the world — a massive task!
Unicode Refresher

It's hard to attach a single meaning to the
  word “character” but most folks think of
  characters as the smallest stand-alone
      components of a writing system.
Unicode Refresher

  In Unicode, this sense of characters is
 represented by one or more code points,
which are each stored in one or more bytes.
Unicode Refresher

      However, programmers and
programming languages tend to think of
  characters as individual code points,
       or worse, individual bytes.

  We need to modernize our habits!
Unicode Refresher

Unicode is not just a big set of characters.
  It also defines standard properties for
 each character and standard algorithms
      for operations such as collation,
     normalization, and segmentation.
Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ
NFC(ᾀ◌̀) = ᾂ̀
Normalization

NFD(Чю◌́рлёнис) = Чю◌́рле◌̈нис
NFC(Чю◌́рлёнис) = Чю◌́рлёнис
Normalization

  ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡
 α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀
             ≠
ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡
 α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
Perl Normalization

use Unicode::Normalize;

say $str;          # ᾀ◌̀
say NFD($str);     # α◌̓◌̀◌ͅ
say NFC($str);     # ᾂ̀
JavaScript Normalization

var unorm = require('unorm');

console.log($str);              # ᾀ◌̀
console.log(unorm.nfd($str));   # α◌̓◌̀◌ͅ
console.log(unorm.nfc($str));   # ᾂ̀
PHP Normalization

echo $str;            # ᾀ◌̀

echo Normalizer::normalize($str,
Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str,
Normalizer::FORM_C); # ᾂ̀
Grapheme Clusters

regex:      /^.$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match code point (excl. n)
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
Grapheme Clusters

regex:         /^.$/

string 1:     ᾂ
             ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
4. 1 success but 1 failure — mixed results �
Grapheme Clusters

regex:      /^X$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match grapheme cluster
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
4. success! �
Perl

use   v5.12; # better yet: v5.14
use   utf8;
use   charnames qw( :full ); # unless v5.16
use   open qw( :encoding(UTF-8) :std );

$str =~ /^X$/;

$str =~ s/^(X)$/->$1<-/;
PHP

preg_match('/^X$/u', $str);

preg_replace('/^(X)$/u', '->$1<-', $str);
JavaScript
[This slide intentionally left blank.]
Match Any Character

two bytes (if byte mode):      е..и
code point (exc. n):          е.и
code point (incl. n):         еp{Any}и
grapheme cluster (incl. n):   еXи
Match Any Letter

letter code point:еp{General_Category=Letter}и
letter code point:   еpLи
Cyrillic code point: еp{Script=Cyrillic}и
Cyrillic code point: еp{Cyrillic}и

letter grapheme cluster: е(?=pL)Xи
regex:      / о p{Cyrillic} т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
regex:         / о p{Cyrillic} т /x

string 1:      който


string 2:      кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
4. 1 success but 1 failure — mixed results �
regex:      / о (?= p{Cyrillic} ) X т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
5. success! �
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

       [‫]يی‬

     (?:‫)ي|ی‬

[x{064A}x{06CC}]
Character Literals

            [‫]يی‬

          (?:‫)ي|ی‬

     [x{064A}x{06CC}]

   [N{ARABIC LETTER YEH}
N{ARABIC LETTER FARSI YEH}]
Properties

         p{Script=Latin}

           Name: Script
           Value: Latin


   Match any code point with the
value “Latin” for the Script property.
Properties

         P{Script=Latin}

           Name: Script
          Value: not Latin

           Negated form:
 Match any code point without the
value “Latin” for the Script property.
Properties

           p{Latin}

     Name: Script (implicit)
        Value: Latin


The Script and General Category
properties don't require the name
because they're so common and
    their values don't conflict.
Properties

     p{General_Category=Letter}

        Name: General Category
            Value: Letter


   Match any code point with the value
“Letter” for the General Category property.
Properties

          p{gc=Letter}

   Name: General Category (gc)
          Value: Letter


Property names may be abbreviated.
Properties

            p{gc=L}

 Name: General Category (gc)
      Value: Letter (L)


The General Category property is
so commonly used that its values
 all have standard abbreviations.
Properties

                   p{L}

    Name: General Category (implicit)
           Value: Letter (L)


And the General Category values may even
be used on their own, like the Script values.
 These two properties have distinct values.
Properties

               pL

Name: General Category (implicit)
       Value: Letter (L)


Single-character General Category
 values don't require curly braces.
Properties

               PL

Name: General Category (implicit)
      Value: not Letter (L)


      Don't forget negation!
s/�/�/g

More Related Content

PDF
Cs3430 lecture 16
PDF
regular expressions (Regex)
ODP
Regular Expression
ODP
Regex Presentation
PPT
Clean code _v2003
PPTX
Tech Days Paris Intoduction F# and Collective Intelligence
PDF
Writing Parsers and Compilers with PLY
PPTX
Regular expressions
Cs3430 lecture 16
regular expressions (Regex)
Regular Expression
Regex Presentation
Clean code _v2003
Tech Days Paris Intoduction F# and Collective Intelligence
Writing Parsers and Compilers with PLY
Regular expressions

What's hot (20)

PDF
Declarative Semantics Definition - Term Rewriting
PPTX
Regular Expression
PPT
Regular Expressions grep and egrep
PPTX
Finaal application on regular expression
PPTX
Regular expressions
PPTX
Optimization of dfa
PDF
Introduction - Imperative and Object-Oriented Languages
PPT
Regular Expressions
PPTX
Regular Expressions 101 Introduction to Regular Expressions
PPT
Regular Expression
PDF
And now you have two problems. Ruby regular expressions for fun and profit by...
PPT
PPT
Regular expressions
PDF
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
PPT
Haskell retrospective
PDF
DEFUN 2008 - Real World Haskell
PPTX
Introduction to Regular Expressions
PPTX
Deduplication on large amounts of code
PPTX
Regular expressions
Declarative Semantics Definition - Term Rewriting
Regular Expression
Regular Expressions grep and egrep
Finaal application on regular expression
Regular expressions
Optimization of dfa
Introduction - Imperative and Object-Oriented Languages
Regular Expressions
Regular Expressions 101 Introduction to Regular Expressions
Regular Expression
And now you have two problems. Ruby regular expressions for fun and profit by...
Regular expressions
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Haskell retrospective
DEFUN 2008 - Real World Haskell
Introduction to Regular Expressions
Deduplication on large amounts of code
Regular expressions
Ad

Similar to Unicode Regular Expressions (20)

PDF
Regular Expressions: JavaScript And Beyond
PDF
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
PDF
Linux fundamental - Chap 06 regx
PPT
Perl Presentation
PDF
Ruby presentasjon på NTNU 22 april 2009
PDF
Ruby presentasjon på NTNU 22 april 2009
PDF
Ruby presentasjon på NTNU 22 april 2009
PPT
1CompilerDesigningss_LexicalAnalysis.ppt
PPT
Saumya Debray The University of Arizona Tucson
PPT
Compiler design Lexical analysis based on lex
PPT
Cleancode
PPTX
Lecture 3 Perl & FreeBSD administration
PDF
Good Evils In Perl
PDF
Stop overusing regular expressions!
PDF
Recursive descent parsing
PDF
Perl_Part4
PDF
Practical approach to perl day1
PPT
Introduction to Perl
PDF
Fundamental Unicode in Perl
PPT
Bioinformatica 06-10-2011-p2 introduction
Regular Expressions: JavaScript And Beyond
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
Linux fundamental - Chap 06 regx
Perl Presentation
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
1CompilerDesigningss_LexicalAnalysis.ppt
Saumya Debray The University of Arizona Tucson
Compiler design Lexical analysis based on lex
Cleancode
Lecture 3 Perl & FreeBSD administration
Good Evils In Perl
Stop overusing regular expressions!
Recursive descent parsing
Perl_Part4
Practical approach to perl day1
Introduction to Perl
Fundamental Unicode in Perl
Bioinformatica 06-10-2011-p2 introduction
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Spectroscopy.pptx food analysis technology
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25-Week II
Spectroscopy.pptx food analysis technology
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Unicode Regular Expressions

  • 1. Unicode Regular Expressions s/�/�/g Nick Patch 23 January 2013
  • 2. Unicode Refresher Unicode attempts to support the characters of the world — a massive task!
  • 3. Unicode Refresher It's hard to attach a single meaning to the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  • 4. Unicode Refresher In Unicode, this sense of characters is represented by one or more code points, which are each stored in one or more bytes.
  • 5. Unicode Refresher However, programmers and programming languages tend to think of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  • 6. Unicode Refresher Unicode is not just a big set of characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  • 9. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  • 10. Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str); # α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀
  • 11. JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀
  • 12. PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
  • 13. Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 14. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 15. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. n)
  • 16. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string
  • 17. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results �
  • 18. Grapheme Clusters regex: /^X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 19. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 20. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster
  • 21. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string
  • 22. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success! �
  • 23. Perl use v5.12; # better yet: v5.14 use utf8; use charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^X$/; $str =~ s/^(X)$/->$1<-/;
  • 26. Match Any Character two bytes (if byte mode): е..и code point (exc. n): е.и code point (incl. n): еp{Any}и grapheme cluster (incl. n): еXи
  • 27. Match Any Letter letter code point:еp{General_Category=Letter}и letter code point: еpLи Cyrillic code point: еp{Script=Cyrillic}и Cyrillic code point: еp{Cyrillic}и letter grapheme cluster: е(?=pL)Xи
  • 28. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то
  • 29. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 30. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point)
  • 31. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т
  • 32. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results �
  • 33. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то
  • 34. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 35. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)
  • 36. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)
  • 37. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т
  • 38. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success! �
  • 39. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 40. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 41. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}]
  • 42. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}] [N{ARABIC LETTER YEH} N{ARABIC LETTER FARSI YEH}]
  • 43. Properties p{Script=Latin} Name: Script Value: Latin Match any code point with the value “Latin” for the Script property.
  • 44. Properties P{Script=Latin} Name: Script Value: not Latin Negated form: Match any code point without the value “Latin” for the Script property.
  • 45. Properties p{Latin} Name: Script (implicit) Value: Latin The Script and General Category properties don't require the name because they're so common and their values don't conflict.
  • 46. Properties p{General_Category=Letter} Name: General Category Value: Letter Match any code point with the value “Letter” for the General Category property.
  • 47. Properties p{gc=Letter} Name: General Category (gc) Value: Letter Property names may be abbreviated.
  • 48. Properties p{gc=L} Name: General Category (gc) Value: Letter (L) The General Category property is so commonly used that its values all have standard abbreviations.
  • 49. Properties p{L} Name: General Category (implicit) Value: Letter (L) And the General Category values may even be used on their own, like the Script values. These two properties have distinct values.
  • 50. Properties pL Name: General Category (implicit) Value: Letter (L) Single-character General Category values don't require curly braces.
  • 51. Properties PL Name: General Category (implicit) Value: not Letter (L) Don't forget negation!