Fundamental Unicode in Perl

Fundamental
Unicode
Nick Patch

“The smallest component of
written language that has semantic value;
refers to the abstract meaning and/or shape,
rather than a specific shape.”
—The Unicode Consortium
What Is a Character?

Glyphs are visual
representations of characters.
What Is a Glyph?

Glyphs are visual
Fonts are collections of glyphs.
What Is a Glyph?

Glyphs are visual
There may be many different glyphs
for the same character.
What Is a Glyph?

Glyphs are visual
There may be many different glyphs
for the same character.
This talk is not about fonts or glyphs.
What Is a Glyph?

. / ?
「 « » 」
Punctuation

CARRIAGE RETURN
NO-BREAK SPACE
COMBINING GRAPHEME JOINER
RIGHT-TO-LEFT MARK
Control Characters

Many people use “character set”
to mean one or more of these:
Character Code
Character Encoding
Character Repertoire
Which makes for a confusing situation.
Character Set

A defined mapping of
characters to numbers.
A ⇒ 41
B ⇒ 42
C ⇒ 43
Each value in a character code
is called a code point.
Character Code

An algorithm to convert
code points to a digital form for ease
of transmitting or storing data.
41 (A) ⇒ 1000001
42 (B) ⇒ 1000010
43 (C) ⇒ 1000011
Character Encoding

A character repertoire is a
collection of distinct characters.
Character codes, keyboards, and
written languages all have
well-defined character repertoires.
Character Repertoire

ASCII
character code: 128 code points
character encoding: 7 bits each
Character Codes & Encodings

ASCII
character encoding: 7 bits each
Latin 1 (ISO-8859-1)
character encoding: 8 bits (1 byte) each

Unicode (character code)
1,112,064 code points (110,000+ defined)

Unicode (character code)
1,112,064 code points (110,000+ defined)
character encodings:
UTF-8 — 1 to 4 bytes each
UTF-16 — 2 or 4 bytes each
UTF-32 — 4 bytes each

A
U+0041
LATIN CAPITAL LETTER A
໓
U+0ED3
LAO DIGIT THREE
U+1F4A9
PILE OF POO
Code Points

Some code points have
precomposed diacritics.
ȫ
U+022B
LATIN SMALL LETTER O
WITH DIAERESIS AND MACRON
Code Points

Other characters must be composed
from multiple code points
using “combing characters.”
n̈
U+006E
LATIN SMALL LETTER N
U+0308
COMBINING DIAERESIS
Code Points

Any series of code points that are composed
into a single user-perceived character.
Informally known as “graphemes.”
A (U+0041)
n̥̈ (U+006E U+0308 U+0325)
CRLF (U+000D U+000A)
Grapheme Clusters

U+1F42A
DROMEDARY CAMEL
Time for some…

# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
String constants ... TIMTOWTDI

# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
use v5.12;
say "N{U+00A1}jalapeN{U+00D1}o!";

use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";

use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
use utf8;
say '¡jalapeño!';

=encoding UTF-8
=head1 ¡jalapeño!
String constants ... POD

UTF-8 encoded input
⇩
decode
⇩
Perl Unicode string
⇩
encode
⇩
UTF-8 encoded output
I/O

open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
I/O

open my $fh, '>:encoding(UTF-8)', $filename;
binmode $fh, ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';
I/O

use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
I/O

# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
I/O

# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
# CPAN module to enable everything UTF-8
use utf8::all;
I/O

use Encode;
my $internal = decode('UTF-8', $input);
my $output = encode('UTF-8', $internal);
Explicit Encoding & Decoding

Let’s use this grapheme cluster as the
string in our next example:
ю́
U+044E
CYRILLIC SMALL LETTER YU
U+0301
COMBINING ACUTE ACCENT
String Length

# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
String Length

use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
String Length

use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
my $length = () = $grapheme =~ /X/g;
say $length; # 1
String Length

# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
String Length

say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
String Length

say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
# a little better
$length++ while $str =~ /X/g;
say $length;
String Length

# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
String Length

# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
# and yet another (Warning: I wrote it!)
use Unicode::Util qw( grapheme_length );
say grapheme_length($str);
String Length

Standard ordering of strings
for comparison and sorting.
sort @names
$a cmp $b
$x gt $y
$foo eq $bar
Collation

Perl provides a collation algorithm
based on code points.
Collation

@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
Collation

@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
sort { lc $a cmp lc $b } @words
# durian Xerxes Äpfel
Collation

Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
Collation

use Unicode::Collate;
my $collator = Unicode::Collate->new;
$collator->sort(@words);
# Äpfel durian Xerxes
Collation

$collator->sort(@names)
$collator->cmp($a, $b)
$collator->gt($x, $y)
$collator->eq($foo, $bar)
Collation

UCA also provides locale-specific collations
for different languages.
Collation

UCA also provides locale-specific collations
for different languages.
use Unicode::Collate::Locale;
my $kolator = Unicode::Collate::Locale->new(
locale => 'pl' # Polish
);
Collation

Unicode has 4 normalization forms.
The most important are:
NFD: Normalization Form
Canonical Decomposition
NFC: Normalization Form
Canonical Composition
Normalization

use Unicode::Normalize;
# NFD can be helpful on input
$str = NFD($input);
# NFC is recommended on output
$output = NFC($str);
Normalization

UTF-8 encoded input
⇩
decode
⇩
NFD
⇩
Perl Unicode string
⇩
NFC
⇩
encode
⇩
UTF-8 encoded output
Normalization

By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
Unicode Semantics

utf8::upgrade($str);
Unicode Semantics

use v5.12;
Unicode Semantics

use v5.12;
use feature 'unicode_strings';
Unicode Semantics

You’ll see the “utf8” encoding
used frequently in Perl.
“utf8” follows the UTF-8 standard very
loosely and allows many errors
in your data without warnings.
By default, use “UTF-8” instead.
UTF-8 vs. utf8 vs. :utf8

# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding

# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
# insecure! no encoding validation at all
open my $fh, '<:utf8', $filename;
# proper UTF-8 validation

Slides will be posted to:
@nickpatch
Questions?

Fundamental Unicode in Perl

More Related Content

What's hot (17)

Viewers also liked (20)

Similar to Fundamental Unicode in Perl (20)

Recently uploaded (20)

Fundamental Unicode in Perl