Unicode Regular Expressions

Unicode
Regular Expressions

s/�/�/g
Nick Patch
23 January 2013

Unicode Refresher

Unicode attempts to support the
characters of the world — a massive task!

Unicode Refresher

It's hard to attach a single meaning to the
word “character” but most folks think of
characters as the smallest stand-alone
components of a writing system.

Unicode Refresher

In Unicode, this sense of characters is
represented by one or more code points,
which are each stored in one or more bytes.

Unicode Refresher

However, programmers and
programming languages tend to think of
characters as individual code points,
or worse, individual bytes.

We need to modernize our habits!

Unicode Refresher

Unicode is not just a big set of characters.
It also defines standard properties for
each character and standard algorithms
for operations such as collation,
normalization, and segmentation.

Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ
NFC(ᾀ◌̀) = ᾂ̀

Normalization

NFD(Чю◌́рлёнис) = Чю◌́рле◌̈нис
NFC(Чю◌́рлёнис) = Чю◌́рлёнис

Normalization

ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡
α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀
≠
ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡
α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓

Perl Normalization

use Unicode::Normalize;

say $str; # ᾀ◌̀
say NFD($str); # α◌̓◌̀◌ͅ
say NFC($str); # ᾂ̀

JavaScript Normalization

var unorm = require('unorm');

console.log($str); # ᾀ◌̀
console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅ
console.log(unorm.nfc($str)); # ᾂ̀

PHP Normalization

echo $str; # ᾀ◌̀

echo Normalizer::normalize($str,
Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str,
Normalizer::FORM_C); # ᾂ̀

Grapheme Clusters

regex: /^.$/

string 1: ᾂ

string 2: α◌̓◌̀◌ͅ

Grapheme Clusters

regex: /^.$/

string 1: ᾂ
⇧

⇧

1. anchor beginning of string

Grapheme Clusters

regex: /^.$/

string 1: ᾂ
⇧

⇧

2. match code point (excl. n)

Grapheme Clusters

regex: /^.$/

string 1: ᾂ
⇧⇧


3. anchor at end of string

Grapheme Clusters

regex: /^.$/

string 1: ᾂ
⇧⇧


4. 1 success but 1 failure — mixed results �

Grapheme Clusters

regex: /^X$/

string 1: ᾂ


Grapheme Clusters

regex: /^X$/

string 1: ᾂ
⇧

⇧


Grapheme Clusters

regex: /^X$/

string 1: ᾂ
⇧

⇧

2. match grapheme cluster

Grapheme Clusters

regex: /^X$/

string 1: ᾂ
⇧⇧

⇧ ⇧


Grapheme Clusters

regex: /^X$/

string 1: ᾂ
⇧⇧

⇧ ⇧

4. success! �

Perl

use v5.12; # better yet: v5.14
use utf8;
use charnames qw( :full ); # unless v5.16
use open qw( :encoding(UTF-8) :std );

$str =~ /^X$/;

$str =~ s/^(X)$/->$1<-/;

PHP

preg_match('/^X$/u', $str);

preg_replace('/^(X)$/u', '->$1<-', $str);

JavaScript
[This slide intentionally left blank.]

Match Any Character

two bytes (if byte mode): е..и
code point (exc. n): е.и
code point (incl. n): еp{Any}и
grapheme cluster (incl. n): еXи

Match Any Letter

letter code point:еp{General_Category=Letter}и
letter code point: еpLи
Cyrillic code point: еp{Script=Cyrillic}и
Cyrillic code point: еp{Cyrillic}и

letter grapheme cluster: е(?=pL)Xи

regex: / о p{Cyrillic} т /x

string 1: който

string 2: кои◌̆то




1. match letter о




1. match letter о
2. match Cyrillic letter (1 code point)




1. match letter о
3. match letter т




1. match letter о
3. match letter т
4. 1 success but 1 failure — mixed results �

regex: / о (?= p{Cyrillic} ) X т /x






1. match letter о


⇧

⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)


⇧

⇧

1. match letter о
3. match grapheme cluster (1+ code points)


⇧

⇧

1. match letter о
4. match letter т


⇧

⇧

1. match letter о
4. match letter т
5. success! �

Character Literals

[‫]يی‬

(?:‫)ي|ی‬

Character Literals

[‫]يی‬

(?:‫)ي|ی‬

[x{064A}x{06CC}]

Character Literals

[‫]يی‬

(?:‫)ي|ی‬

[x{064A}x{06CC}]

[N{ARABIC LETTER YEH}
N{ARABIC LETTER FARSI YEH}]

Properties

p{Script=Latin}

Name: Script
Value: Latin

Match any code point with the
value “Latin” for the Script property.

Properties

P{Script=Latin}

Name: Script
Value: not Latin

Negated form:
Match any code point without the
value “Latin” for the Script property.

Properties

p{Latin}

Name: Script (implicit)
Value: Latin

The Script and General Category
properties don't require the name
because they're so common and
their values don't conflict.

Properties

p{General_Category=Letter}

Name: General Category
Value: Letter

Match any code point with the value
“Letter” for the General Category property.

Properties

p{gc=Letter}

Name: General Category (gc)
Value: Letter

Property names may be abbreviated.

Properties

p{gc=L}

Name: General Category (gc)
Value: Letter (L)

The General Category property is
so commonly used that its values
all have standard abbreviations.

Properties

p{L}

Name: General Category (implicit)
Value: Letter (L)

And the General Category values may even
be used on their own, like the Script values.
These two properties have distinct values.

Properties

pL

Value: Letter (L)

Single-character General Category
values don't require curly braces.

Properties

PL

Value: not Letter (L)

Don't forget negation!

Unicode Regular Expressions

More Related Content

What's hot (20)

Similar to Unicode Regular Expressions (20)

Recently uploaded (20)

Unicode Regular Expressions