SlideShare a Scribd company logo
Fundamental
Unicode
Nick Patch
“The smallest component of
written language that has semantic value;
refers to the abstract meaning and/or shape,
rather than a specific shape.”
—The Unicode Consortium
What Is a Character?
Glyphs are visual
representations of characters.
What Is a Glyph?
Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
What Is a Glyph?
Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
There may be many different glyphs
for the same character.
What Is a Glyph?
Glyphs are visual
representations of characters.
Fonts are collections of glyphs.
There may be many different glyphs
for the same character.
This talk is not about fonts or glyphs.
What Is a Glyph?
a b c
π ‫ث‬ й
Letters
1 2 3
໓
๓ ३
Numbers
. / ?
「 « » 」
Punctuation
™ © ≠
☺ ☠
Symbols
CARRIAGE RETURN
NO-BREAK SPACE
COMBINING GRAPHEME JOINER
RIGHT-TO-LEFT MARK
Control Characters
Many people use “character set”
to mean one or more of these:
Character Code
Character Encoding
Character Repertoire
Which makes for a confusing situation.
Character Set
A defined mapping of
characters to numbers.
A ⇒ 41
B ⇒ 42
C ⇒ 43
Each value in a character code
is called a code point.
Character Code
An algorithm to convert
code points to a digital form for ease
of transmitting or storing data.
41 (A) ⇒ 1000001
42 (B) ⇒ 1000010
43 (C) ⇒ 1000011
Character Encoding
A character repertoire is a
collection of distinct characters.
Character codes, keyboards, and
written languages all have
well-defined character repertoires.
Character Repertoire
ASCII
character code: 128 code points
character encoding: 7 bits each
Character Codes & Encodings
ASCII
character code: 128 code points
character encoding: 7 bits each
Latin 1 (ISO-8859-1)
character code: 256 code points
character encoding: 8 bits (1 byte) each
Character Codes & Encodings
Unicode (character code)
1,112,064 code points (110,000+ defined)
Character Codes & Encodings
Unicode (character code)
1,112,064 code points (110,000+ defined)
character encodings:
UTF-8 — 1 to 4 bytes each
UTF-16 — 2 or 4 bytes each
UTF-32 — 4 bytes each
Character Codes & Encodings
A
U+0041
LATIN CAPITAL LETTER A
໓
U+0ED3
LAO DIGIT THREE
U+1F4A9
PILE OF POO
Code Points
Some code points have
precomposed diacritics.
ȫ
U+022B
LATIN SMALL LETTER O
WITH DIAERESIS AND MACRON
Code Points
Other characters must be composed
from multiple code points
using “combing characters.”
n̈
U+006E
LATIN SMALL LETTER N
U+0308
COMBINING DIAERESIS
Code Points
Any series of code points that are composed
into a single user-perceived character.
Informally known as “graphemes.”
A (U+0041)
n̥̈ (U+006E U+0308 U+0325)
CRLF (U+000D U+000A)
Grapheme Clusters
U+1F42A
DROMEDARY CAMEL
Time for some…
# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
String constants ... TIMTOWTDI
# ¡jalapeño!
say "x{A1}jalapex{D1}o!";
use v5.12;
say "N{U+00A1}jalapeN{U+00D1}o!";
String constants ... TIMTOWTDI
use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
String constants ... TIMTOWTDI
use charnames qw( :full );
say "N{INVERTED EXCLAMATION
MARK}jalapeN{LATIN SMALL LETTER N WITH
TILDE}o!";
use utf8;
say '¡jalapeño!';
String constants ... TIMTOWTDI
=encoding UTF-8
=head1 ¡jalapeño!
String constants ... POD
UTF-8 encoded input
⇩
decode
⇩
Perl Unicode string
⇩
encode
⇩
UTF-8 encoded output
I/O
open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
I/O
open my $fh, '<:encoding(UTF-8)', $filename;
open my $fh, '>:encoding(UTF-8)', $filename;
binmode $fh, ':encoding(UTF-8)';
binmode STDIN, ':encoding(UTF-8)';
I/O
use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
I/O
use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
I/O
use open qw( :encoding(UTF-8) );
open my $fh, '<', $filename;
# :std for STDIN, STDOUT, STDERR
use open qw( :encoding(UTF-8) :std );
# CPAN module to enable everything UTF-8
use utf8::all;
I/O
use Encode;
my $internal = decode('UTF-8', $input);
my $output = encode('UTF-8', $internal);
Explicit Encoding & Decoding
Let’s use this grapheme cluster as the
string in our next example:
ю́
U+044E
CYRILLIC SMALL LETTER YU
U+0301
COMBINING ACUTE ACCENT
String Length
# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
String Length
# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
String Length
# UTF-8 encoded: D1 8E CC 81
say length $encoded_grapheme; # 4
use Encode;
# Unicode string: 044E 0301
my $grapheme = decode('UTF-8', $encoded);
say length $grapheme; # 2
my $length = () = $grapheme =~ /X/g;
say $length; # 1
String Length
# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
String Length
# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
String Length
# sort of complex for a simple length, eh?
my $length = () = $str =~ /X/g;
say $length;
# and tricky depending on the context
say scalar( () = $str =~ /X/g );
# a little better
$length++ while $str =~ /X/g;
say $length;
String Length
# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
String Length
# an alternative approach
use Unicode::GCString;
say Unicode::GCString->new($str)->length;
# and yet another (Warning: I wrote it!)
use Unicode::Util qw( grapheme_length );
say grapheme_length($str);
String Length
Standard ordering of strings
for comparison and sorting.
sort @names
$a cmp $b
$x gt $y
$foo eq $bar
Collation
Perl provides a collation algorithm
based on code points.
Collation
Perl provides a collation algorithm
based on code points.
@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
Collation
Perl provides a collation algorithm
based on code points.
@words = qw( Äpfel durian Xerxes )
sort @words
# Xerxes durian Äpfel
sort { lc $a cmp lc $b } @words
# durian Xerxes Äpfel
Collation
Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
Collation
Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
use Unicode::Collate;
my $collator = Unicode::Collate->new;
$collator->sort(@words);
# Äpfel durian Xerxes
Collation
Unicode Collation Algorithm (UCA) provides
collation based on natural language usage.
$collator->sort(@names)
$collator->cmp($a, $b)
$collator->gt($x, $y)
$collator->eq($foo, $bar)
Collation
UCA also provides locale-specific collations
for different languages.
Collation
UCA also provides locale-specific collations
for different languages.
use Unicode::Collate::Locale;
my $kolator = Unicode::Collate::Locale->new(
locale => 'pl' # Polish
);
Collation
Unicode has 4 normalization forms.
The most important are:
NFD: Normalization Form
Canonical Decomposition
NFC: Normalization Form
Canonical Composition
Normalization
use Unicode::Normalize;
# NFD can be helpful on input
$str = NFD($input);
# NFC is recommended on output
$output = NFC($str);
Normalization
UTF-8 encoded input
⇩
decode
⇩
NFD
⇩
Perl Unicode string
⇩
NFC
⇩
encode
⇩
UTF-8 encoded output
Normalization
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
Unicode Semantics
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
Unicode Semantics
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
use v5.12;
Unicode Semantics
By default, unfortunately, strings and regexes are
not guaranteed to use Unicode semantics.
This is known as “The Unicode Bug.”
There are a few ways to fix this:
utf8::upgrade($str);
use v5.12;
use feature 'unicode_strings';
Unicode Semantics
You’ll see the “utf8” encoding
used frequently in Perl.
“utf8” follows the UTF-8 standard very
loosely and allows many errors
in your data without warnings.
By default, use “UTF-8” instead.
UTF-8 vs. utf8 vs. :utf8
# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
my $internal = decode('UTF-8', $input);
UTF-8 vs. utf8 vs. :utf8
# utf8 is Perl's internal encoding form
my $internal = decode('utf8', $input);
# UTF-8 is the official UTF-8 encoding
my $internal = decode('UTF-8', $input);
# insecure! no encoding validation at all
open my $fh, '<:utf8', $filename;
# proper UTF-8 validation
open my $fh, '<:encoding(UTF-8)', $filename;
UTF-8 vs. utf8 vs. :utf8
Slides will be posted to:
@nickpatch
Questions?

More Related Content

PDF
Domain Driven Design with the F# type System -- NDC London 2013
PDF
F# for C# Programmers
PDF
Domain Modeling Made Functional (KanDDDinsky 2019)
PDF
Domain Modeling with FP (DDD Europe 2020)
PPTX
The NuGram dynamic grammar language
PDF
Four Languages From Forty Years Ago (NewCrafts 2019)
PPTX
C# overview part 1
PPTX
An ABNF Primer
Domain Driven Design with the F# type System -- NDC London 2013
F# for C# Programmers
Domain Modeling Made Functional (KanDDDinsky 2019)
Domain Modeling with FP (DDD Europe 2020)
The NuGram dynamic grammar language
Four Languages From Forty Years Ago (NewCrafts 2019)
C# overview part 1
An ABNF Primer

What's hot (17)

PPTX
C# slid
PDF
Types and perl language
PPTX
Perl slid
PDF
Clean code: meaningful Name
PPTX
CSharp Language Overview Part 1
PDF
Four Languages From Forty Years Ago
PPT
7.1.intro perl
PPTX
Learn About Simple Tricks For Coding & Decoding
ODP
Regex Presentation
PPTX
natural language processing
PDF
Aspects of software naturalness through the generation of IdentifierNames
PPT
PPTX
Lecture 2 php basics (1)
PPTX
Naming Standards, Clean Code
PDF
How to improve the quality of your TYPO3 extensions
PPTX
C# slid
Types and perl language
Perl slid
Clean code: meaningful Name
CSharp Language Overview Part 1
Four Languages From Forty Years Ago
7.1.intro perl
Learn About Simple Tricks For Coding & Decoding
Regex Presentation
natural language processing
Aspects of software naturalness through the generation of IdentifierNames
Lecture 2 php basics (1)
Naming Standards, Clean Code
How to improve the quality of your TYPO3 extensions
Ad

Viewers also liked (20)

PDF
Trading StocksSemanal04/03/2011
DOCX
Linea del tempo .. del patinaje de carreras
PDF
ORDINARY DIPLOMA
PDF
1469797563-109815594
PDF
Тема 10
PDF
Grelha Baseado Na Afe 6 1 Me Lamas
PDF
Specialty Silica
PPS
El cajero automatico
PDF
NEXT GAS 486.4 MILES copy
DOC
Símbolos Da 1ª República
DOC
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
PDF
Gene expression analysis in storage root of cassava using microarray data
PPTX
Presentation_NEW.PPTX
PDF
Mada's New logo
PPS
Para liberar el estress
PPTX
Seducción en la publicidad B2C
PPTX
Best Exercise bike 2016
PPT
BSA Four Pillars
PDF
Calendario
Trading StocksSemanal04/03/2011
Linea del tempo .. del patinaje de carreras
ORDINARY DIPLOMA
1469797563-109815594
Тема 10
Grelha Baseado Na Afe 6 1 Me Lamas
Specialty Silica
El cajero automatico
NEXT GAS 486.4 MILES copy
Símbolos Da 1ª República
ТЕМА 11. ПРОИЗВОДСТВЕННАЯ ЦЕПОЧКА И УЧАСТНИКИ РЫНКА РЕКЛАМЫ И СВЯЗЕЙ С ОБЩЕСТ...
Gene expression analysis in storage root of cassava using microarray data
Presentation_NEW.PPTX
Mada's New logo
Para liberar el estress
Seducción en la publicidad B2C
Best Exercise bike 2016
BSA Four Pillars
Calendario
Ad

Similar to Fundamental Unicode in Perl (20)

PDF
Understand unicode & utf8 in perl (2)
PPTX
Unicode and character sets
PDF
Unicode Explained Includes Index 1st Ed Korpela Jukka K
PDF
Character-Encoding-UnicodeWebinarSlides-20230515.pdf
PDF
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
PPTX
Jun 29 new privacy technologies for unicode and international data standards ...
PPT
Lecture_ASCII and Unicode.ppt
PPT
Unicode
PDF
The 9th Bit: Encodings in Ruby 1.9
PPT
Unicode Fundamentals
PDF
Notes on a Standard: Unicode
PPT
Encodings - Ruby 1.8 and Ruby 1.9
PPTX
Encoding Nightmares (and how to avoid them)
PDF
Unicode
PDF
Abap slide class4 unicode-plusfiles
PDF
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
PDF
Data encryption and tokenization for international unicode
KEY
Perl And Unicode
PPTX
Unicode
PDF
Unicode Primer for the Uninitiated
Understand unicode & utf8 in perl (2)
Unicode and character sets
Unicode Explained Includes Index 1st Ed Korpela Jukka K
Character-Encoding-UnicodeWebinarSlides-20230515.pdf
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
Jun 29 new privacy technologies for unicode and international data standards ...
Lecture_ASCII and Unicode.ppt
Unicode
The 9th Bit: Encodings in Ruby 1.9
Unicode Fundamentals
Notes on a Standard: Unicode
Encodings - Ruby 1.8 and Ruby 1.9
Encoding Nightmares (and how to avoid them)
Unicode
Abap slide class4 unicode-plusfiles
An Explanation of the Unicode, the Text Encoding Standard, Its Usages and Imp...
Data encryption and tokenization for international unicode
Perl And Unicode
Unicode
Unicode Primer for the Uninitiated

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Review of recent advances in non-invasive hemoglobin estimation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.

Fundamental Unicode in Perl

  • 2. “The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape.” —The Unicode Consortium What Is a Character?
  • 3. Glyphs are visual representations of characters. What Is a Glyph?
  • 4. Glyphs are visual representations of characters. Fonts are collections of glyphs. What Is a Glyph?
  • 5. Glyphs are visual representations of characters. Fonts are collections of glyphs. There may be many different glyphs for the same character. What Is a Glyph?
  • 6. Glyphs are visual representations of characters. Fonts are collections of glyphs. There may be many different glyphs for the same character. This talk is not about fonts or glyphs. What Is a Glyph?
  • 7. a b c π ‫ث‬ й Letters
  • 8. 1 2 3 ໓ ๓ ३ Numbers
  • 9. . / ? 「 « » 」 Punctuation
  • 10. ™ © ≠ ☺ ☠ Symbols
  • 11. CARRIAGE RETURN NO-BREAK SPACE COMBINING GRAPHEME JOINER RIGHT-TO-LEFT MARK Control Characters
  • 12. Many people use “character set” to mean one or more of these: Character Code Character Encoding Character Repertoire Which makes for a confusing situation. Character Set
  • 13. A defined mapping of characters to numbers. A ⇒ 41 B ⇒ 42 C ⇒ 43 Each value in a character code is called a code point. Character Code
  • 14. An algorithm to convert code points to a digital form for ease of transmitting or storing data. 41 (A) ⇒ 1000001 42 (B) ⇒ 1000010 43 (C) ⇒ 1000011 Character Encoding
  • 15. A character repertoire is a collection of distinct characters. Character codes, keyboards, and written languages all have well-defined character repertoires. Character Repertoire
  • 16. ASCII character code: 128 code points character encoding: 7 bits each Character Codes & Encodings
  • 17. ASCII character code: 128 code points character encoding: 7 bits each Latin 1 (ISO-8859-1) character code: 256 code points character encoding: 8 bits (1 byte) each Character Codes & Encodings
  • 18. Unicode (character code) 1,112,064 code points (110,000+ defined) Character Codes & Encodings
  • 19. Unicode (character code) 1,112,064 code points (110,000+ defined) character encodings: UTF-8 — 1 to 4 bytes each UTF-16 — 2 or 4 bytes each UTF-32 — 4 bytes each Character Codes & Encodings
  • 20. A U+0041 LATIN CAPITAL LETTER A ໓ U+0ED3 LAO DIGIT THREE U+1F4A9 PILE OF POO Code Points
  • 21. Some code points have precomposed diacritics. ȫ U+022B LATIN SMALL LETTER O WITH DIAERESIS AND MACRON Code Points
  • 22. Other characters must be composed from multiple code points using “combing characters.” n̈ U+006E LATIN SMALL LETTER N U+0308 COMBINING DIAERESIS Code Points
  • 23. Any series of code points that are composed into a single user-perceived character. Informally known as “graphemes.” A (U+0041) n̥̈ (U+006E U+0308 U+0325) CRLF (U+000D U+000A) Grapheme Clusters
  • 26. # ¡jalapeño! say "x{A1}jalapex{D1}o!"; use v5.12; say "N{U+00A1}jalapeN{U+00D1}o!"; String constants ... TIMTOWTDI
  • 27. use charnames qw( :full ); say "N{INVERTED EXCLAMATION MARK}jalapeN{LATIN SMALL LETTER N WITH TILDE}o!"; String constants ... TIMTOWTDI
  • 28. use charnames qw( :full ); say "N{INVERTED EXCLAMATION MARK}jalapeN{LATIN SMALL LETTER N WITH TILDE}o!"; use utf8; say '¡jalapeño!'; String constants ... TIMTOWTDI
  • 30. UTF-8 encoded input ⇩ decode ⇩ Perl Unicode string ⇩ encode ⇩ UTF-8 encoded output I/O
  • 31. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename; I/O
  • 32. open my $fh, '<:encoding(UTF-8)', $filename; open my $fh, '>:encoding(UTF-8)', $filename; binmode $fh, ':encoding(UTF-8)'; binmode STDIN, ':encoding(UTF-8)'; I/O
  • 33. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; I/O
  • 34. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); I/O
  • 35. use open qw( :encoding(UTF-8) ); open my $fh, '<', $filename; # :std for STDIN, STDOUT, STDERR use open qw( :encoding(UTF-8) :std ); # CPAN module to enable everything UTF-8 use utf8::all; I/O
  • 36. use Encode; my $internal = decode('UTF-8', $input); my $output = encode('UTF-8', $internal); Explicit Encoding & Decoding
  • 37. Let’s use this grapheme cluster as the string in our next example: ю́ U+044E CYRILLIC SMALL LETTER YU U+0301 COMBINING ACUTE ACCENT String Length
  • 38. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 String Length
  • 39. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 String Length
  • 40. # UTF-8 encoded: D1 8E CC 81 say length $encoded_grapheme; # 4 use Encode; # Unicode string: 044E 0301 my $grapheme = decode('UTF-8', $encoded); say length $grapheme; # 2 my $length = () = $grapheme =~ /X/g; say $length; # 1 String Length
  • 41. # sort of complex for a simple length, eh? my $length = () = $str =~ /X/g; say $length; String Length
  • 42. # sort of complex for a simple length, eh? my $length = () = $str =~ /X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /X/g ); String Length
  • 43. # sort of complex for a simple length, eh? my $length = () = $str =~ /X/g; say $length; # and tricky depending on the context say scalar( () = $str =~ /X/g ); # a little better $length++ while $str =~ /X/g; say $length; String Length
  • 44. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; String Length
  • 45. # an alternative approach use Unicode::GCString; say Unicode::GCString->new($str)->length; # and yet another (Warning: I wrote it!) use Unicode::Util qw( grapheme_length ); say grapheme_length($str); String Length
  • 46. Standard ordering of strings for comparison and sorting. sort @names $a cmp $b $x gt $y $foo eq $bar Collation
  • 47. Perl provides a collation algorithm based on code points. Collation
  • 48. Perl provides a collation algorithm based on code points. @words = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel Collation
  • 49. Perl provides a collation algorithm based on code points. @words = qw( Äpfel durian Xerxes ) sort @words # Xerxes durian Äpfel sort { lc $a cmp lc $b } @words # durian Xerxes Äpfel Collation
  • 50. Unicode Collation Algorithm (UCA) provides collation based on natural language usage. Collation
  • 51. Unicode Collation Algorithm (UCA) provides collation based on natural language usage. use Unicode::Collate; my $collator = Unicode::Collate->new; $collator->sort(@words); # Äpfel durian Xerxes Collation
  • 52. Unicode Collation Algorithm (UCA) provides collation based on natural language usage. $collator->sort(@names) $collator->cmp($a, $b) $collator->gt($x, $y) $collator->eq($foo, $bar) Collation
  • 53. UCA also provides locale-specific collations for different languages. Collation
  • 54. UCA also provides locale-specific collations for different languages. use Unicode::Collate::Locale; my $kolator = Unicode::Collate::Locale->new( locale => 'pl' # Polish ); Collation
  • 55. Unicode has 4 normalization forms. The most important are: NFD: Normalization Form Canonical Decomposition NFC: Normalization Form Canonical Composition Normalization
  • 56. use Unicode::Normalize; # NFD can be helpful on input $str = NFD($input); # NFC is recommended on output $output = NFC($str); Normalization
  • 57. UTF-8 encoded input ⇩ decode ⇩ NFD ⇩ Perl Unicode string ⇩ NFC ⇩ encode ⇩ UTF-8 encoded output Normalization
  • 58. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: Unicode Semantics
  • 59. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); Unicode Semantics
  • 60. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; Unicode Semantics
  • 61. By default, unfortunately, strings and regexes are not guaranteed to use Unicode semantics. This is known as “The Unicode Bug.” There are a few ways to fix this: utf8::upgrade($str); use v5.12; use feature 'unicode_strings'; Unicode Semantics
  • 62. You’ll see the “utf8” encoding used frequently in Perl. “utf8” follows the UTF-8 standard very loosely and allows many errors in your data without warnings. By default, use “UTF-8” instead. UTF-8 vs. utf8 vs. :utf8
  • 63. # utf8 is Perl's internal encoding form my $internal = decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); UTF-8 vs. utf8 vs. :utf8
  • 64. # utf8 is Perl's internal encoding form my $internal = decode('utf8', $input); # UTF-8 is the official UTF-8 encoding my $internal = decode('UTF-8', $input); # insecure! no encoding validation at all open my $fh, '<:utf8', $filename; # proper UTF-8 validation open my $fh, '<:encoding(UTF-8)', $filename; UTF-8 vs. utf8 vs. :utf8
  • 65. Slides will be posted to: @nickpatch Questions?