SlideShare a Scribd company logo
Demystifying Unicode
@colinodell
Colin O’Dell
● Principal Engineer at Unleashed Technologies
● PHP for ~20 years; 13 years professionally
● Creator & maintainer of league/commonmark library
● PHP League leadership team
● Owner of moderngeekware.com
● @colinodell
Agenda
● A History of Encoding Systems
● Unicode Standard
● Unicode Encodings
● Using Unicode in PHP
● Tips & Tricks
● Questions & Answers
Assumptions
● Some familiarity with PHP
● Basic understanding of binary and hexadecimal
● Focus on high-level concepts!
Encoding Systems
Encoding Systems
L 1001100
L
A (Brief) History of
Encoding Systems
1837: Morse Code (Internationalized in 1844)
“Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
1930s: Teleprinters
1960s: Teletypes (TTYs) For Computing
1960s: ASCII
● American Standard Code for Information Interchange
● 7-bit binary encoding
○ 0000000 = 0
○ ...
○ 1111111 = 127
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Character Hex Binary Character Hex Binary
LF (line feed) 0x0A 0001010 E 0x45 1000101
3 0x33 0110011 e 0x65 1100101
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
00xxxxx = 32 control codes
01xxxxx = 32 numbers & symbols
10xxxxx = 32 uppercase letters and some extra symbols
11xxxxx = 32 lowercase letters and some extra symbols
A = 0x41 = 1000001
B = 0x42 = 1000010


Z = 0x5A = 1011010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
A = 0x41 = 1000001
B = 0x42 = 1000010


Z = 0x5A = 1011010
a = 0x61 = 1100001
b = 0x62 = 1100010


z = 0x7A = 1111010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
But computers use 8-bit bytes...
ASCII (7 Bits) ???
Start 00000000 10000000
End 01111111 11111111
Count 128 128
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
7-bit
ASCII
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8
???
9
A
B
C
D
E
F
8-bit
“Extended
ASCII”
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP ÂĄ Âą ÂŁ € „ Š § š © ÂȘ « ÂŹ SHY Âź ÂŻ
B ° ± ÂČ Âł ÂŽ ” ¶ · ž Âč Âș » ÂŒ œ Ÿ Âż
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E Ă  ĂĄ Ăą ĂŁ Ă€ Ă„ ĂŠ ç Ăš Ă© ĂȘ Ă« ĂŹ Ă­ Ăź ĂŻ
F ð ñ ĂČ Ăł ĂŽ Ă” ö Ă· Ăž Ăč Ăș Ă» ĂŒ Ăœ ĂŸ Ăż
ISO
8859-1
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8 € ‚ ƒ „ 
 † ‡ ˆ ‰ Ć  â€č ƒ Ćœ
9 ‘ ’ “ ” ‱ – — ˜ ℱ ĆĄ â€ș Ɠ ĆŸ Ćž
A NBSP ÂĄ Âą ÂŁ € „ Š § š © ÂȘ « ÂŹ SHY Âź ÂŻ
B ° ± ÂČ Âł ÂŽ ” ¶ · ž Âč Âș » ÂŒ œ Ÿ Âż
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E Ă  ĂĄ Ăą ĂŁ Ă€ Ă„ ĂŠ ç Ăš Ă© ĂȘ Ă« ĂŹ Ă­ Ăź ĂŻ
F ð ñ ĂČ Ăł ĂŽ Ă” ö Ă· Ăž Ăč Ăș Ă» ĂŒ Ăœ ĂŸ Ăż
Windows-1252
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP Ą ˘ Ɓ € Äœ Ú § š Ć  ƞ Ć€ Ćč SHY Ćœ Ć»
B ° ą ˛ Ƃ ÂŽ ÄŸ ƛ ˇ ž ĆĄ ß Ć„ Ćș ˝ ĆŸ ĆŒ
C Ɣ Á Â Ă Ä Äč Ć Ç Č É Ę Ë Ě Í Î Ď
D Đ ƃ Ƈ Ó Ô Ɛ Ö × Ƙ Ƽ Ú ư Ü Ý Ʊ ß
E ƕ ĂĄ Ăą ă Ă€ Äș ć ç č Ă© ę Ă« ě Ă­ Ăź ď
F đ Ƅ ƈ Ăł ĂŽ Ƒ ö Ă· ƙ ĆŻ Ăș Ʊ ĂŒ Ăœ ĆŁ ˙
ISO
8859-2
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL â˜ș ☻ ♄ ♩ ♣ ♠ ‱ ◘ ○ ◙ ♂ ♀ â™Ș ♫ ☌
1 â–ș ◄ ↕ ‌ ¶ § ▬ ↹ ↑ ↓ → ← ∟ ↔ â–Č â–Œ
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ ⌂
8 Ç ĂŒ Ă© Ăą Ă€ Ă  Ă„ ç ĂȘ Ă« Ăš ĂŻ Ăź ĂŹ Ä Å
9 É ĂŠ Æ ĂŽ ö ĂČ Ă» Ăč Ăż Ö Ü Âą ÂŁ „ ₧ ƒ
A ĂĄ Ă­ Ăł Ăș ñ Ñ ÂȘ Âș Âż ⌐ ÂŹ œ ÂŒ ÂĄ « »
B ░ ▒ ▓ │ ─ ╡ ╱ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
C └ ┮ ┬ ├ ─ ┌ ╞ ╟ ╚ ╔ ╩ ╩ ╠ ═ ╬ ╧
D ╹ â•€ â•„ ╙ ╘ ╒ ╓ ╫ â•Ș ┘ ┌ █ ▄ ▌ ▐ ▀
E α ß Γ π ÎŁ σ ” τ Ί Θ Ω ÎŽ ∞ φ Δ ∩
F ≡ ± ≄ ≀ ⌠ ⌡ Ă· ≈ ° ∙ · √ ⁿ ÂČ â–  NBSP
Code
Page
437
(IBM
PC)
Demystifying Unicode - Longhorn PHP 2021
8-bit “Extended ASCII”
● ISO 8859 - 16 variations:
○ ISO 8859-1 (“Latin 1”, Western European)
○ ISO 8859-2 (“Latin 2”, Central European)
○ ISO 8859-3 (“Latin 3”, South European)
○ ISO 8859-4 (“Latin 4”, North European)
○ ISO 8859-5 (Latin/Cyrillic)
○ ISO 8859-6 (Latin/Arabic)
○ ISO 8859-7 (Latin/Greek)
○ ISO 8859-8 (Latin/Hebrew)
○ ISO 8859-9 (“Latin 5”, Turkish)
○ ISO 8859-10 (“Latin 6”, Nordic)
○ ISO 8859-11 (Latin/Thai)
○ ISO 8859-12 (Latin/Devanagari) - abandoned
○ ISO 8859-13 (“Latin 7”, Baltic Rim)
○ ISO 8859-14 (“Latin 8”, Celtic)
○ ISO 8859-15 (“Latin 9”)
■ Revision of 8859-1 with swaps out less-
used chars; adds euro currency symbol
○ ISO 8859-16 (“Latin 10”, South-Eastern European)
● Windows-1252
● CP 437 - Original IBM PC
● Mac OS Roman character set
● TRS-80 character set
● Atari’s ATASCII
● Commodore’s PETSCII
● HP Roman-8 and Roman-9
● DEC’s Multinational Character Set
● Lotus International Character Set
● ECMA-94
Demystifying Unicode - Longhorn PHP 2021
Demystifying Unicode - Longhorn PHP 2021
But then along came the Internet...
https://xkcd.com/927/
Demystifying Unicode - Longhorn PHP 2021
“The Unicode Standard is the universal character
encoding standard for written characters and text. It
defines a consistent way of encoding multilingual text
that enables the exchange of text data internationally and
creates the foundation for global software”
Code Points
Problem:
How to accommodate larger character sets without wasting memory?
Solution:
Break the one-to-one correspondence between characters and
bits/encoding! Offer different ways to encode based on
different needs.
ASCII vs. Unicode
Character Encoded Bits
H 01001000 (0x48)
P 01010000 (0x50)
Glyph Code Point
P U+0050
LATIN CAPITAL LETTER P
H U+0048
LATIN CAPITAL LETTER H
Encoded Bits
????
????
Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
ÎŁ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😾
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
Demystifying Unicode - Longhorn PHP 2021
D
U+0044
LATIN CAPITAL LETTER D
Demystifying Unicode - Longhorn PHP 2021
U+1F604
SMILING FACE WITH
OPEN MOUTH AND
SMILING EYES
Demystifying Unicode - Longhorn PHP 2021
Demystifying Unicode - Longhorn PHP 2021
Demystifying Unicode - Longhorn PHP 2021
Demystifying Unicode - Longhorn PHP 2021
Code Planes
Demystifying Unicode - Longhorn PHP 2021
Recap
● Code Point: a number representing a single character*
○ 143,859 defined as of Unicode 13.0
○ Format: U+hhhhhh
● Codespace: A range of numerical values available for encoding characters
○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
● Code Planes: Continuous group of 65,536 (216) code points
○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first
two positions in six position hexadecimal format (U+hhhhhh)
Glyphs and Graphemes
Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
a a a a a a a a
Glyphs:
Glyphs and Graphemes
Glyph /
Grapheme c a f e
Unicode
Character
c a f e
Code Point
U+0063 U+0061 U+0066 U+0065
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f Ă©
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f Ă©
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
e + ◌́ = Ă©
e
Glyphs and Graphemes: Combining Diacritical Marks
Z̷̧̰̋̚AÌžÌŠÌźÍ‰ Ì”Ì„Ì‡Í€Í‰ÌŁ
LÌ”Ì„Ì‡Í€Í‰ÌŁG
ÌžÌŠÌźÍ‰ O
̞̱͒̓ ̷̧̰̋̚T̊͝E̓͝ÌȘ̘̗XÌȘ̘̗T
̞̰̍̈Ìș̝
Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
VS
15
Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
Glyph /
Grapheme
Unicode
Character
✈
Code Point
U+2708 U+FE0F
AIRPLANE
VARIATION
SELECTOR 16
(EMOJI STYLE)
VS
16
VS
15
Glyphs and Graphemes: Regional Indicator Symbols
Glyph /
Grapheme đŸ‡ș🇾
Unicode
Character
đŸ‡ș 🇾
Code Point
U+1F1FA U+1F1F8
REGIONAL
INDICATOR
SYMBOL
LETTER U
REGIONAL
INDICATOR
SYMBOL
LETTER S
Glyph /
Grapheme 🇹🇩
Unicode
Character
🇹 🇩
Code Point
U+1F1E8 U+1F1E6
REGIONAL
INDICATOR
SYMBOL
LETTER C
REGIONAL
INDICATOR
SYMBOL
LETTER A
Glyphs and Graphemes: Modifiers
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FC
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-3
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FE
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-5
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
👹 đŸ‘© đŸ‘¶ 👧
Unicode
Character
👹 đŸ‘© đŸ‘¶ 👧
Code
Point
U+1F468 U+1F469 U+1F476 U+1F467
MAN WOMAN BABY GIRL
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
👹 đŸ‘© đŸ‘¶ 👧
Code
Point
U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467
MAN
ZERO
WIDTH
JOINER
WOMAN
ZERO
WIDTH
JOINER
BABY
ZERO
WIDTH
JOINER
GIRL
ZWJ ZWJ ZWJ
Glyphs and Graphemes: ZWJ Sequences
Glyphs and Graphemes: ZWJ Sequences
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2642
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
MALE SIGN
ZWJ
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2640
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
FEMALE
SIGN
ZWJ
Glyphs and Graphemes: ZWJ Sequences
Glyph / Grapheme
Unicode Character
Code Point
U+1F477 U+1F3FE U+200D U+2640
CONSTRUCTION
WORKER
EMOJI MODIFIER
FITZPATRICK
TYPE-5
ZERO WIDTH
JOINER
FEMALE SIGN
ZWJ
Enough about code points...
Encoding Schemes
Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
ÎŁ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😾
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
Encoding Schemes
● Most popular:
○ UTF-8
○ UTF-16
○ UTF-32
UTF-32
Fixed-byte encoding; 4 bytes per code point
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
UTF-32
Fixed-byte encoding; 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
Examples:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 00000000 00000000 01000001
😾
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
Example:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 01000001
Variable-length encoding; 2 or 4 bytes per character
UTF-16
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Variable-length encoding; 2 or 4 bytes per character
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
0xD800-
0xDBFF
0xDC00-
0xDFFF
Example:
😾
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000
Codepoint range Unicode scalar value (binary) Encoded bytes
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Variable-length encoding; 2 or 4 bytes per character
U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638
W1 = 11011000 00111101 // 0xD800 + 0000111101
W2 = 11011110 00111000 // 0xDC00 + 1000111000
UTF-16
UTF-8
Variable-length encoding; 1-4 bytes per code point
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 1: ASCII === UTF-8
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 2: Virtually all languages only need 1, 2, or 3 bytes
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 3: First byte tells you the length
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 4: Self-synchronization
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 5: No 0x00 bytes, except for NUL
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF Encoding Summary
UTF-32 UTF-16 UTF-8
Encoding length Fixed Variable Variable
4 bytes per code
point
2 or 4 bytes per
code point
1-4 bytes per code
point
Memory-efficient No Somewhat Yes
CPU-efficient Yes Somewhat Somewhat
Self-synchronizing No Yes Yes
Contains null
(0x00) bytes
Yes Yes No
ASCII-compatible No No Yes
https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg
Unicode in PHP
Handling Text In Programming Languages
1. Treat text as a sequence of bytes (PHP, C)
$smile = "xF0x9Fx98x80";
echo $smile; // => '😀'
echo strlen($smile); // => 4
1. Treat text as a sequence of Unicode code points (Python 3)
2. Treat text as a sequence of UTF-16 code units (JavaScript, C#)
const smile = 'uD83DuDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2
PHP Strings
Be careful!
● Strings are simply byte sequences
● Encoding-agnostic
● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
PHP String Functions
Function What It Actually Does
strlen() Counts the length in bytes
str_replace() Replaces bytes
substr() Returns a subset of bytes
strtoupper() Converts alphabetic ASCII bytes to uppercase based on
globally-set locale
Works for ASCII; not entirely safe* for Unicode!
ext/mbstring
Provides multibyte-safe string functions
Standard Function mbstring Alternative
strlen() mb_strlen()
str_replace() (none)
substr() mb_substr()
strtoupper() mb_strtoupper()
Tip: All functions accept an
optional parameter to specify
the encoding, if known; will be
auto-detected otherwise.
ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Mary had a little lamb
MB_CASE_UPPER MARY HAD A LITTLE LAMB
MB_CASE_LOWER mary had a little lamb
MB_CASE_TITLE Mary Had A Little Lamb
MB_CASE_FOLD mary had a little lamb
ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Ich grĂŒĂŸe den Mann
(I greet the man)
MB_CASE_UPPER ICH GRÜSSE DEN MANN
MB_CASE_LOWER ich grĂŒĂŸe den mann
MB_CASE_TITLE Ich GrĂŒĂŸe Den Mann
MB_CASE_FOLD ich grĂŒsse den mann
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Property
Code
Matches Example
L Any letter p{L}
Ll Lower case letter p{Ll}
Lu Upper case letter p{Lu}
Lm Modifier letter p{Lm}
Lt Title case letter p{Lt}
Lo Other letter p{Lo}
Property
Code
Matches Example
S Any symbol p{S}
Sc Currency symbol p{Sc}
Sk Modifier symbol p{Sk}
Sm Mathematical
symbol
p{Sm}
So Other symbol p{So}
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Examples: p{Greek} or p{Egyptian_Hieroglyphs}
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
Match a Unicode extended grapheme cluster: X
Think of it like a . but for multiple characters
that combine into a single glyph
ext/pcre
ext/intl - IntlChar class
var_dump(IntlChar::charName('⛄'));
// string(20) "SNOWMAN WITHOUT SNOW"
$name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS";
var_dump(IntlChar::charFromName($name));
// int(9843)
var_dump(IntlChar::isupper("A"));
// bool(true)
ext/intl - Normalizer class
1. U+01FA - “Precomposed” character (LATIN CAPITAL
LETTER A WITH RING ABOVE AND ACUTE)
2. A + U+030A + U+0301 - A base letter A followed by two
combining marks (U+030A COMBINING RING ABOVE
and U+0301 COMBINING ACUTE ACCENT)
3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN
CAPITAL LETTER A WITH RING ABOVE) followed by a
combining accent (U+0301 COMBINING ACUTE
ACCENT)
4. U+212B + U+0301 - A compatibility character (U+212B
ANGSTROM SIGN) followed by a combining accent
(U+0301 COMBINING ACUTE ACCENT)
Çș
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
Çș
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
Çș
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
// %C7%BA
// %C7%BA
// %C7%BA
// %C7%BA
Çș
ext/intl - Grapheme Functions
grapheme_​
extract()
grapheme_​
stripos()
grapheme_​
stristr()
grapheme_​
strlen()
grapheme_​
strpos()
grapheme_​
strripos()
grapheme_​
strrpos()
grapheme_​
strstr()
grapheme_​
substr()
$str = '⛄ Café';
echo strlen($str); // 10
echo mb_strlen($str); // 7
echo grapheme_strlen($str); // 6
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
// This is the Euro symbol 'EUR'.
echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
// This is the Euro symbol ''.
PHP Extension Summary
ext/iconv: Convert between encodings
ext/mbstring: Work with multi-byte string encodings like UTF-8
ext/pcre: Special UTF-compatible matching when /u modifier enabled
ext/intl: Work with individual codepoints and graphemes
Fun Tricks & Micro-Optimizations
Disclaimer
Clever hacks and micro-optimizations are usually unnecessary and can be
detrimental to long-term maintenance!
Don’t use these unless you absolutely need them.
Taking Advantage of UTF-Encoded Bytes
PHP string functions can still be used in some cases:
if (str_contains($utf8, '&')) { 
 }
$trimmed = trim($utf8);
$firstChar = substr($utf32, 0, 4);
Requires solid understanding of UTF encodings and what the functions do
Don’t be clever unless there’s a clear advantage!
Splitting Strings Into Codepoints
mb_str_split($str) - returns array of individual codepoints (PHP 7.4+)
UTF-8 polyfill for older versions: preg_split('//u', $str)
(Works for codepoints, not graphemes)
ASCII-Only UTF-8 Strings
Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions:
$isAscii = mb_detect_encoding($str, 'ASCII', true);
Micro-optimization (2x faster):
$isASCII = strlen($str) === mb_strlen($str);
Speed is fractions of milliseconds; micro-optimization only
important for parsing-heavy applications
Writing Silly Code
PHP supports Unicode in variable and function names:
class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {}
throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»;
Writing Silly Code
PHP supports Unicode in variable and function names:
class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {}
throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»;
Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT
PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
Writing Silly Code (Don’t Do This)
PHP supports Unicode in variable and function names:
class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {}
throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
Writing Silly Code (Seriously, Don’t Do This)
PHP supports Unicode in variable and function names:
class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {}
throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
$you can use = 'U+2000 EN QUAD whitespace';
Recap
Recap & Recommendations
● Unicode supports virtually every known modern and historic writing system
● Codepoints != Glyphs/Graphemes != Encoding
● Use and support UTF-8 everywhere, especially for user input
● PHP strings are just raw bytes
● Use mbstring functions
Questions?
Thank You!
Slides & feedback: https://joind.in/talk/9bdc2
Questions? @colinodell or colinodell@gmail.com

More Related Content

PDF
ŰłÙŠŰ±Ű© ÙŰ§Ű·Ù…Ű© Ű§Ù„ŰČÙ‡Ű±Ű§ŰĄ ŰčÙ„ÙŠÙ‡Ű§ Ű§Ù„ŰłÙ„Ű§Ù…
PDF
Bhart rahari 01-06
PDF
Diplom_Momchil_EN.PDF
PDF
Radio 2.0 paris conference
KEY
NoSQL Now! Introduction to Graph Databases
PDF
毆ç‘Șć­žćŽŸç†èˆ‡æŠ€èĄ“
PDF
ç¶Čè·Żæ”»æ“ŠæŠ€èĄ“ćˆ†æž
PDF
Brit- slang
ŰłÙŠŰ±Ű© ÙŰ§Ű·Ù…Ű© Ű§Ù„ŰČÙ‡Ű±Ű§ŰĄ ŰčÙ„ÙŠÙ‡Ű§ Ű§Ù„ŰłÙ„Ű§Ù…
Bhart rahari 01-06
Diplom_Momchil_EN.PDF
Radio 2.0 paris conference
NoSQL Now! Introduction to Graph Databases
毆ç‘Șć­žćŽŸç†èˆ‡æŠ€èĄ“
ç¶Čè·Żæ”»æ“ŠæŠ€èĄ“ćˆ†æž
Brit- slang

What's hot (19)

PDF
Đ”ĐŸĐŽĐ°Ń‚ĐŸĐș 22
PDF
Cassandra introduction at FinishJUG
PPT
wreewrer
PDF
PyLadies Talk: Learn to love the command line!
PDF
UGC NET COMPUTER SCIENCE JUNE 2009 PAPER-II
PDF
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...
PDF
Social Network Analysis With R
PDF
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
PDF
UGC NET COMPUTER SCIENCE JUNE 2010 PAPER-II
PDF
ОтчДт
PDF
No Flex Zone: Empathy Driven Development
PDF
she'ir-ehmetjan
PDF
Profiling Web Archives IIPC GA 2015
PDF
Writing (Meteor) Code With Style
PDF
ă‚ČăƒŒăƒ ç†è«–BASIC 珏27曞 - äș€æž‰ă‚ČăƒŒăƒ  : äș€æž‰ć•éĄŒăšăƒŠăƒƒă‚·ăƒ„äș€æž‰è§Ł-
PDF
PDF
Kaggle Google Quest Q&A Labeling ćçœäŒš LTèł‡æ–™ 47th place solution
PDF
Meteor - not just for rockstars
PDF
PostgreSQL Day italy 2016 Unit Test
Đ”ĐŸĐŽĐ°Ń‚ĐŸĐș 22
Cassandra introduction at FinishJUG
wreewrer
PyLadies Talk: Learn to love the command line!
UGC NET COMPUTER SCIENCE JUNE 2009 PAPER-II
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...
Social Network Analysis With R
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
UGC NET COMPUTER SCIENCE JUNE 2010 PAPER-II
ОтчДт
No Flex Zone: Empathy Driven Development
she'ir-ehmetjan
Profiling Web Archives IIPC GA 2015
Writing (Meteor) Code With Style
ă‚ČăƒŒăƒ ç†è«–BASIC 珏27曞 - äș€æž‰ă‚ČăƒŒăƒ  : äș€æž‰ć•éĄŒăšăƒŠăƒƒă‚·ăƒ„äș€æž‰è§Ł-
Kaggle Google Quest Q&A Labeling ćçœäŒš LTèł‡æ–™ 47th place solution
Meteor - not just for rockstars
PostgreSQL Day italy 2016 Unit Test
Ad

Similar to Demystifying Unicode - Longhorn PHP 2021 (20)

PDF
32 weight loss tips for men – how men can lose weight
PDF
Sahih boukhary 1
PDF
Evolution towards the Internet of Everything
PDF
Secretaria de SaĂșde de Pernambuco negou antecipação de vacina aos rodoviĂĄrios
PDF
Math Workbook Grade 1 Module.pdf
PDF
Kubernetes ăƒăƒƒăƒˆăƒŻăƒŒă‚­ăƒłă‚°ăźă™ăčお
PDF
Đ”ĐŸĐŽĐ°Ń‚ĐŸĐș 3
PDF
RabatBangla.pdf
PDF
CĂłmo abrir archivos .HLP WinHelp en Windows 10.pdf
PDF
Phap luat giao dich dien tu
PDF
Ű§Ù„Ű„ŰłŰȘۧŰȘÙŠÙƒŰ§
PDF
1998 ACURA INTEGRA Service Repair Manual
PDF
ă€äżźćŁ«è«–æ–‡çŽčä»‹ă€‘ă‚œăƒŒă‚·ăƒŁăƒ«ăƒĄăƒ‡ă‚Łă‚ąă‹ă‚‰ăźäœœç”šă‚’è€ƒæ…źă—ăŸé‡‘èžćž‚ć ŽăźèŠłæžŹăƒ»äșˆæžŹăƒąăƒ‡ăƒ«ăźææĄˆ
PDF
Mat fin
PDF
(Guia para elaborar,_estrutura
PDF
Ebook4385(www.takbook.com)
PDF
08_0648_EA.pdf
PDF
katagaitai CTF workshop #10 AESă«ćŻŸă™ă‚‹ç›žé–ąé›»ćŠ›è§Łæž
 
PDF
Ugly Duck Clothing UK eBay Store
32 weight loss tips for men – how men can lose weight
Sahih boukhary 1
Evolution towards the Internet of Everything
Secretaria de SaĂșde de Pernambuco negou antecipação de vacina aos rodoviĂĄrios
Math Workbook Grade 1 Module.pdf
Kubernetes ăƒăƒƒăƒˆăƒŻăƒŒă‚­ăƒłă‚°ăźă™ăčお
Đ”ĐŸĐŽĐ°Ń‚ĐŸĐș 3
RabatBangla.pdf
CĂłmo abrir archivos .HLP WinHelp en Windows 10.pdf
Phap luat giao dich dien tu
Ű§Ù„Ű„ŰłŰȘۧŰȘÙŠÙƒŰ§
1998 ACURA INTEGRA Service Repair Manual
ă€äżźćŁ«è«–æ–‡çŽčä»‹ă€‘ă‚œăƒŒă‚·ăƒŁăƒ«ăƒĄăƒ‡ă‚Łă‚ąă‹ă‚‰ăźäœœç”šă‚’è€ƒæ…źă—ăŸé‡‘èžćž‚ć ŽăźèŠłæžŹăƒ»äșˆæžŹăƒąăƒ‡ăƒ«ăźææĄˆ
Mat fin
(Guia para elaborar,_estrutura
Ebook4385(www.takbook.com)
08_0648_EA.pdf
katagaitai CTF workshop #10 AESă«ćŻŸă™ă‚‹ç›žé–ąé›»ćŠ›è§Łæž
 
Ugly Duck Clothing UK eBay Store
Ad

More from Colin O'Dell (20)

PPTX
Releasing High Quality Packages - Longhorn PHP 2021
PPTX
Releasing High Quality PHP Packages - ConFoo Montreal 2019
PPTX
Debugging Effectively - ConFoo Montreal 2019
PPTX
Automating Deployments with Deployer - php[world] 2018
PPTX
Releasing High-Quality Packages - php[world] 2018
PPTX
Debugging Effectively - DrupalCon Nashville 2018
PPTX
CommonMark: Markdown Done Right - ZendCon 2017
PDF
Rise of the Machines: PHP and IoT - ZendCon 2017
PPTX
Debugging Effectively - All Things Open 2017
PPTX
Hacking Your Way To Better Security - DrupalCon Baltimore 2017
PPTX
Debugging Effectively - PHP UK 2017
PPTX
Debugging Effectively - SunshinePHP 2017
PPTX
Automating Your Workflow with Gulp.js - php[world] 2016
PPTX
Rise of the Machines: PHP and IoT - php[world] 2016
PPTX
Debugging Effectively - ZendCon 2016
PPTX
Hacking Your Way to Better Security - ZendCon 2016
PPTX
Hacking Your Way to Better Security - PHP South Africa 2016
PPTX
Debugging Effectively - DrupalCon Europe 2016
PPTX
CommonMark: Markdown done right - Nomad PHP September 2016
PPTX
Debugging Effectively - Frederick Web Tech 9/6/16
Releasing High Quality Packages - Longhorn PHP 2021
Releasing High Quality PHP Packages - ConFoo Montreal 2019
Debugging Effectively - ConFoo Montreal 2019
Automating Deployments with Deployer - php[world] 2018
Releasing High-Quality Packages - php[world] 2018
Debugging Effectively - DrupalCon Nashville 2018
CommonMark: Markdown Done Right - ZendCon 2017
Rise of the Machines: PHP and IoT - ZendCon 2017
Debugging Effectively - All Things Open 2017
Hacking Your Way To Better Security - DrupalCon Baltimore 2017
Debugging Effectively - PHP UK 2017
Debugging Effectively - SunshinePHP 2017
Automating Your Workflow with Gulp.js - php[world] 2016
Rise of the Machines: PHP and IoT - php[world] 2016
Debugging Effectively - ZendCon 2016
Hacking Your Way to Better Security - ZendCon 2016
Hacking Your Way to Better Security - PHP South Africa 2016
Debugging Effectively - DrupalCon Europe 2016
CommonMark: Markdown done right - Nomad PHP September 2016
Debugging Effectively - Frederick Web Tech 9/6/16

Recently uploaded (20)

PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Introduction to Artificial Intelligence
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
top salesforce developer skills in 2025.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ai tools demonstartion for schools and inter college
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Which alternative to Crystal Reports is best for small or large businesses.pdf
L1 - Introduction to python Backend.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Understanding Forklifts - TECH EHS Solution
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
CHAPTER 2 - PM Management and IT Context
Introduction to Artificial Intelligence
How to Choose the Right IT Partner for Your Business in Malaysia
Softaken Excel to vCard Converter Software.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
top salesforce developer skills in 2025.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
ai tools demonstartion for schools and inter college

Demystifying Unicode - Longhorn PHP 2021

  • 2. Colin O’Dell ● Principal Engineer at Unleashed Technologies ● PHP for ~20 years; 13 years professionally ● Creator & maintainer of league/commonmark library ● PHP League leadership team ● Owner of moderngeekware.com ● @colinodell
  • 3. Agenda ● A History of Encoding Systems ● Unicode Standard ● Unicode Encodings ● Using Unicode in PHP ● Tips & Tricks ● Questions & Answers
  • 4. Assumptions ● Some familiarity with PHP ● Basic understanding of binary and hexadecimal ● Focus on high-level concepts!
  • 7. A (Brief) History of Encoding Systems
  • 8. 1837: Morse Code (Internationalized in 1844) “Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
  • 10. 1960s: Teletypes (TTYs) For Computing
  • 11. 1960s: ASCII ● American Standard Code for Information Interchange ● 7-bit binary encoding ○ 0000000 = 0 ○ ... ○ 1111111 = 127
  • 12. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL
  • 13. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL Character Hex Binary Character Hex Binary LF (line feed) 0x0A 0001010 E 0x45 1000101 3 0x33 0110011 e 0x65 1100101
  • 14. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx 00xxxxx = 32 control codes 01xxxxx = 32 numbers & symbols 10xxxxx = 32 uppercase letters and some extra symbols 11xxxxx = 32 lowercase letters and some extra symbols
  • 15. A = 0x41 = 1000001 B = 0x42 = 1000010 
 Z = 0x5A = 1011010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  • 16. A = 0x41 = 1000001 B = 0x42 = 1000010 
 Z = 0x5A = 1011010 a = 0x61 = 1100001 b = 0x62 = 1100010 
 z = 0x7A = 1111010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  • 17. But computers use 8-bit bytes... ASCII (7 Bits) ??? Start 00000000 10000000 End 01111111 11111111 Count 128 128
  • 18. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 7-bit ASCII
  • 19. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 ??? 9 A B C D E F 8-bit “Extended ASCII”
  • 20. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP ÂĄ Âą ÂŁ € „ Š § š © ÂȘ « ÂŹ SHY Âź ÂŻ B ° ± ÂČ Âł ÂŽ ” ¶ · ž Âč Âș » ÂŒ œ Ÿ Âż C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E Ă  ĂĄ Ăą ĂŁ Ă€ Ă„ ĂŠ ç Ăš Ă© ĂȘ Ă« ĂŹ Ă­ Ăź ĂŻ F ð ñ ĂČ Ăł ĂŽ Ă” ö Ă· Ăž Ăč Ăș Ă» ĂŒ Ăœ ĂŸ Ăż ISO 8859-1
  • 21. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 € ‚ ƒ „ 
 † ‡ ˆ ‰ Ć  â€č ƒ Ćœ 9 ‘ ’ “ ” ‱ – — ˜ ℱ ĆĄ â€ș Ɠ ĆŸ Ćž A NBSP ÂĄ Âą ÂŁ € „ Š § š © ÂȘ « ÂŹ SHY Âź ÂŻ B ° ± ÂČ Âł ÂŽ ” ¶ · ž Âč Âș » ÂŒ œ Ÿ Âż C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E Ă  ĂĄ Ăą ĂŁ Ă€ Ă„ ĂŠ ç Ăš Ă© ĂȘ Ă« ĂŹ Ă­ Ăź ĂŻ F ð ñ ĂČ Ăł ĂŽ Ă” ö Ă· Ăž Ăč Ăș Ă» ĂŒ Ăœ ĂŸ Ăż Windows-1252
  • 22. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP Ą ˘ Ɓ € Äœ Ú § š Ć  ƞ Ć€ Ćč SHY Ćœ Ć» B ° ą ˛ Ƃ ÂŽ ÄŸ ƛ ˇ ž ĆĄ ß Ć„ Ćș ˝ ĆŸ ĆŒ C Ɣ Á Â Ă Ä Äč Ć Ç Č É Ę Ë Ě Í Î Ď D Đ ƃ Ƈ Ó Ô Ɛ Ö × Ƙ Ćź Ú ư Ü Ý Ćą ß E ƕ ĂĄ Ăą ă Ă€ Äș ć ç č Ă© ę Ă« ě Ă­ Ăź ď F đ Ƅ ƈ Ăł ĂŽ Ƒ ö Ă· ƙ ĆŻ Ăș Ʊ ĂŒ Ăœ ĆŁ ˙ ISO 8859-2
  • 23. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL â˜ș ☻ ♄ ♩ ♣ ♠ ‱ ◘ ○ ◙ ♂ ♀ â™Ș ♫ ☌ 1 â–ș ◄ ↕ ‌ ¶ § ▬ ↹ ↑ ↓ → ← ∟ ↔ â–Č â–Œ 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ ⌂ 8 Ç ĂŒ Ă© Ăą Ă€ Ă  Ă„ ç ĂȘ Ă« Ăš ĂŻ Ăź ĂŹ Ä Å 9 É ĂŠ Æ ĂŽ ö ĂČ Ă» Ăč Ăż Ö Ü Âą ÂŁ „ ₧ ƒ A ĂĄ Ă­ Ăł Ăș ñ Ñ ÂȘ Âș Âż ⌐ ÂŹ œ ÂŒ ÂĄ « » B ░ ▒ ▓ │ ─ ╡ ╱ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ C └ ┮ ┬ ├ ─ ┌ ╞ ╟ ╚ ╔ ╩ ╩ ╠ ═ ╬ ╧ D ╹ â•€ â•„ ╙ ╘ ╒ ╓ ╫ â•Ș ┘ ┌ █ ▄ ▌ ▐ ▀ E α ß Γ π ÎŁ σ ” τ Ί Θ Ω ÎŽ ∞ φ Δ ∩ F ≡ ± ≄ ≀ ⌠ ⌡ Ă· ≈ ° ∙ · √ ⁿ ÂČ â–  NBSP Code Page 437 (IBM PC)
  • 25. 8-bit “Extended ASCII” ● ISO 8859 - 16 variations: ○ ISO 8859-1 (“Latin 1”, Western European) ○ ISO 8859-2 (“Latin 2”, Central European) ○ ISO 8859-3 (“Latin 3”, South European) ○ ISO 8859-4 (“Latin 4”, North European) ○ ISO 8859-5 (Latin/Cyrillic) ○ ISO 8859-6 (Latin/Arabic) ○ ISO 8859-7 (Latin/Greek) ○ ISO 8859-8 (Latin/Hebrew) ○ ISO 8859-9 (“Latin 5”, Turkish) ○ ISO 8859-10 (“Latin 6”, Nordic) ○ ISO 8859-11 (Latin/Thai) ○ ISO 8859-12 (Latin/Devanagari) - abandoned ○ ISO 8859-13 (“Latin 7”, Baltic Rim) ○ ISO 8859-14 (“Latin 8”, Celtic) ○ ISO 8859-15 (“Latin 9”) ■ Revision of 8859-1 with swaps out less- used chars; adds euro currency symbol ○ ISO 8859-16 (“Latin 10”, South-Eastern European) ● Windows-1252 ● CP 437 - Original IBM PC ● Mac OS Roman character set ● TRS-80 character set ● Atari’s ATASCII ● Commodore’s PETSCII ● HP Roman-8 and Roman-9 ● DEC’s Multinational Character Set ● Lotus International Character Set ● ECMA-94
  • 28. But then along came the Internet...
  • 31. “The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software”
  • 32. Code Points Problem: How to accommodate larger character sets without wasting memory? Solution: Break the one-to-one correspondence between characters and bits/encoding! Offer different ways to encode based on different needs.
  • 33. ASCII vs. Unicode Character Encoded Bits H 01001000 (0x48) P 01010000 (0x50) Glyph Code Point P U+0050 LATIN CAPITAL LETTER P H U+0048 LATIN CAPITAL LETTER H Encoded Bits ???? ????
  • 34. Glyph Code Point Encoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? ÎŁ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😾 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  • 38. U+1F604 SMILING FACE WITH OPEN MOUTH AND SMILING EYES
  • 45. Recap ● Code Point: a number representing a single character* ○ 143,859 defined as of Unicode 13.0 ○ Format: U+hhhhhh ● Codespace: A range of numerical values available for encoding characters ○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) ● Code Planes: Continuous group of 65,536 (216) code points ○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
  • 47. Character / Code Point: a U+0061 LATIN SMALL LETTER A
  • 48. Character / Code Point: a U+0061 LATIN SMALL LETTER A a a a a a a a a Glyphs:
  • 49. Glyphs and Graphemes Glyph / Grapheme c a f e Unicode Character c a f e Code Point U+0063 U+0061 U+0066 U+0065 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E
  • 50. Glyphs and Graphemes: Combining Diacritical Marks Glyph / Grapheme c a f Ă© Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT
  • 51. Glyphs and Graphemes: Combining Diacritical Marks Glyph / Grapheme c a f Ă© Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT e + ◌́ = Ă© e
  • 52. Glyphs and Graphemes: Combining Diacritical Marks Z̷̧̰̋̚AÌžÌŠÌźÍ‰ Ì”Ì„Ì‡Í€Í‰ÌŁ LÌ”Ì„Ì‡Í€Í‰ÌŁG ÌžÌŠÌźÍ‰ O ̞̱͒̓ ̷̧̰̋̚T̊͝E̓͝ÌȘ̘̗XÌȘ̘̗T ̞̰̍̈Ìș̝
  • 53. Glyphs and Graphemes: Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) VS 15
  • 54. Glyphs and Graphemes: Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) Glyph / Grapheme Unicode Character ✈ Code Point U+2708 U+FE0F AIRPLANE VARIATION SELECTOR 16 (EMOJI STYLE) VS 16 VS 15
  • 55. Glyphs and Graphemes: Regional Indicator Symbols Glyph / Grapheme đŸ‡ș🇾 Unicode Character đŸ‡ș 🇾 Code Point U+1F1FA U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER U REGIONAL INDICATOR SYMBOL LETTER S Glyph / Grapheme 🇹🇩 Unicode Character 🇹 🇩 Code Point U+1F1E8 U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER C REGIONAL INDICATOR SYMBOL LETTER A
  • 56. Glyphs and Graphemes: Modifiers Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FC WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-3 Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FE WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-5
  • 57. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme 👹 đŸ‘© đŸ‘¶ 👧 Unicode Character 👹 đŸ‘© đŸ‘¶ 👧 Code Point U+1F468 U+1F469 U+1F476 U+1F467 MAN WOMAN BABY GIRL
  • 58. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character 👹 đŸ‘© đŸ‘¶ 👧 Code Point U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467 MAN ZERO WIDTH JOINER WOMAN ZERO WIDTH JOINER BABY ZERO WIDTH JOINER GIRL ZWJ ZWJ ZWJ
  • 59. Glyphs and Graphemes: ZWJ Sequences
  • 60. Glyphs and Graphemes: ZWJ Sequences
  • 61. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2642 CONSTRU CTION WORKER ZERO WIDTH JOINER MALE SIGN ZWJ Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2640 CONSTRU CTION WORKER ZERO WIDTH JOINER FEMALE SIGN ZWJ
  • 62. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+1F3FE U+200D U+2640 CONSTRUCTION WORKER EMOJI MODIFIER FITZPATRICK TYPE-5 ZERO WIDTH JOINER FEMALE SIGN ZWJ
  • 63. Enough about code points...
  • 65. Glyph Code Point Encoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? ÎŁ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😾 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  • 66. Encoding Schemes ● Most popular: ○ UTF-8 ○ UTF-16 ○ UTF-32
  • 67. UTF-32 Fixed-byte encoding; 4 bytes per code point Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
  • 68. UTF-32 Fixed-byte encoding; 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx Examples: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 00000000 00000000 01000001 😾 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
  • 69. UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  • 70. Example: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 01000001 Variable-length encoding; 2 or 4 bytes per character UTF-16 Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  • 71. UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  • 72. U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  • 73. Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 0xD800- 0xDBFF 0xDC00- 0xDFFF
  • 74. Example: 😾 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000 Codepoint range Unicode scalar value (binary) Encoded bytes U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638 W1 = 11011000 00111101 // 0xD800 + 0000111101 W2 = 11011110 00111000 // 0xDC00 + 1000111000 UTF-16
  • 75. UTF-8 Variable-length encoding; 1-4 bytes per code point Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 76. UTF-8 Trick 1: ASCII === UTF-8 Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 77. UTF-8 Trick 2: Virtually all languages only need 1, 2, or 3 bytes Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 78. UTF-8 Trick 3: First byte tells you the length Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 79. UTF-8 Trick 4: Self-synchronization Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 80. UTF-8 Trick 5: No 0x00 bytes, except for NUL Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 81. UTF Encoding Summary UTF-32 UTF-16 UTF-8 Encoding length Fixed Variable Variable 4 bytes per code point 2 or 4 bytes per code point 1-4 bytes per code point Memory-efficient No Somewhat Yes CPU-efficient Yes Somewhat Somewhat Self-synchronizing No Yes Yes Contains null (0x00) bytes Yes Yes No ASCII-compatible No No Yes
  • 84. Handling Text In Programming Languages 1. Treat text as a sequence of bytes (PHP, C) $smile = "xF0x9Fx98x80"; echo $smile; // => '😀' echo strlen($smile); // => 4 1. Treat text as a sequence of Unicode code points (Python 3) 2. Treat text as a sequence of UTF-16 code units (JavaScript, C#) const smile = 'uD83DuDE00'; console.log(smile); // => '😀' console.log(smile.length); // => 2
  • 85. PHP Strings Be careful! ● Strings are simply byte sequences ● Encoding-agnostic ● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
  • 86. PHP String Functions Function What It Actually Does strlen() Counts the length in bytes str_replace() Replaces bytes substr() Returns a subset of bytes strtoupper() Converts alphabetic ASCII bytes to uppercase based on globally-set locale Works for ASCII; not entirely safe* for Unicode!
  • 87. ext/mbstring Provides multibyte-safe string functions Standard Function mbstring Alternative strlen() mb_strlen() str_replace() (none) substr() mb_substr() strtoupper() mb_strtoupper() Tip: All functions accept an optional parameter to specify the encoding, if known; will be auto-detected otherwise.
  • 88. ext/mbstring Provides multibyte-safe string functions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Mary had a little lamb MB_CASE_UPPER MARY HAD A LITTLE LAMB MB_CASE_LOWER mary had a little lamb MB_CASE_TITLE Mary Had A Little Lamb MB_CASE_FOLD mary had a little lamb
  • 89. ext/mbstring Provides multibyte-safe string functions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Ich grĂŒĂŸe den Mann (I greet the man) MB_CASE_UPPER ICH GRÜSSE DEN MANN MB_CASE_LOWER ich grĂŒĂŸe den mann MB_CASE_TITLE Ich GrĂŒĂŸe Den Mann MB_CASE_FOLD ich grĂŒsse den mann
  • 90. ext/pcre Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Property Code Matches Example L Any letter p{L} Ll Lower case letter p{Ll} Lu Upper case letter p{Lu} Lm Modifier letter p{Lm} Lt Title case letter p{Lt} Lo Other letter p{Lo} Property Code Matches Example S Any symbol p{S} Sc Currency symbol p{Sc} Sk Modifier symbol p{Sk} Sm Mathematical symbol p{Sm} So Other symbol p{So}
  • 91. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Examples: p{Greek} or p{Egyptian_Hieroglyphs} ext/pcre
  • 92. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} ext/pcre
  • 93. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} Match a Unicode extended grapheme cluster: X Think of it like a . but for multiple characters that combine into a single glyph ext/pcre
  • 94. ext/intl - IntlChar class var_dump(IntlChar::charName('⛄')); // string(20) "SNOWMAN WITHOUT SNOW" $name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS"; var_dump(IntlChar::charFromName($name)); // int(9843) var_dump(IntlChar::isupper("A")); // bool(true)
  • 95. ext/intl - Normalizer class 1. U+01FA - “Precomposed” character (LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE) 2. A + U+030A + U+0301 - A base letter A followed by two combining marks (U+030A COMBINING RING ABOVE and U+0301 COMBINING ACUTE ACCENT) 3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) 4. U+212B + U+0301 - A compatibility character (U+212B ANGSTROM SIGN) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) Çș
  • 96. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; Çș
  • 97. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } Çș
  • 98. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } // %C7%BA // %C7%BA // %C7%BA // %C7%BA Çș
  • 99. ext/intl - Grapheme Functions grapheme_​ extract() grapheme_​ stripos() grapheme_​ stristr() grapheme_​ strlen() grapheme_​ strpos() grapheme_​ strripos() grapheme_​ strrpos() grapheme_​ strstr() grapheme_​ substr() $str = '⛄ Café'; echo strlen($str); // 10 echo mb_strlen($str); // 7 echo grapheme_strlen($str); // 6
  • 100. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string
  • 101. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string
  • 102. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL; // This is the Euro symbol 'EUR'. echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL; // This is the Euro symbol ''.
  • 103. PHP Extension Summary ext/iconv: Convert between encodings ext/mbstring: Work with multi-byte string encodings like UTF-8 ext/pcre: Special UTF-compatible matching when /u modifier enabled ext/intl: Work with individual codepoints and graphemes
  • 104. Fun Tricks & Micro-Optimizations
  • 105. Disclaimer Clever hacks and micro-optimizations are usually unnecessary and can be detrimental to long-term maintenance! Don’t use these unless you absolutely need them.
  • 106. Taking Advantage of UTF-Encoded Bytes PHP string functions can still be used in some cases: if (str_contains($utf8, '&')) { 
 } $trimmed = trim($utf8); $firstChar = substr($utf32, 0, 4); Requires solid understanding of UTF encodings and what the functions do Don’t be clever unless there’s a clear advantage!
  • 107. Splitting Strings Into Codepoints mb_str_split($str) - returns array of individual codepoints (PHP 7.4+) UTF-8 polyfill for older versions: preg_split('//u', $str) (Works for codepoints, not graphemes)
  • 108. ASCII-Only UTF-8 Strings Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions: $isAscii = mb_detect_encoding($str, 'ASCII', true); Micro-optimization (2x faster): $isASCII = strlen($str) === mb_strlen($str); Speed is fractions of milliseconds; micro-optimization only important for parsing-heavy applications
  • 109. Writing Silly Code PHP supports Unicode in variable and function names: class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {} throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»;
  • 110. Writing Silly Code PHP supports Unicode in variable and function names: class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {} throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»; Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
  • 111. Writing Silly Code (Don’t Do This) PHP supports Unicode in variable and function names: class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {} throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
  • 112. Writing Silly Code (Seriously, Don’t Do This) PHP supports Unicode in variable and function names: class ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”» extends Exception {} throw new ïŒˆâ•ŻÂ°â–ĄÂ°ïŒ‰â•Żïž”â”»â”â”»; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference $you can use = 'U+2000 EN QUAD whitespace';
  • 113. Recap
  • 114. Recap & Recommendations ● Unicode supports virtually every known modern and historic writing system ● Codepoints != Glyphs/Graphemes != Encoding ● Use and support UTF-8 everywhere, especially for user input ● PHP strings are just raw bytes ● Use mbstring functions
  • 116. Thank You! Slides & feedback: https://joind.in/talk/9bdc2 Questions? @colinodell or colinodell@gmail.com

Editor's Notes

  • #4: Questions as we go? Raise hand
  • #8: Converts characters into electrical signals
  • #9: Standardized in 1865
  • #10: Simple device Type a key, sends some numbers, same letter comes out the other side
  • #11: But there needs to be a standard
  • #12: Developed in 1960s for teleprinters (“Teletype”) and early computers 7-bit: each letter you type in gets converted into 7 bits
  • #13: Support for: Upper and lowercase letters Numbers Basic, common symbols More control codes (CR, LF, BS, HT, BEL) (next for examples)
  • #14: (how to encode/decode)
  • #15: Something really clever going on here Group by first two bits 4 “pages” or sections, 32 chars each
  • #16: Letters in alphabetical order, starting at 1 (not random)
  • #17: Even more clever - converting between upper and lowercase by changing one bit
  • #20: “Extended ASCII” sounds like a standard, but it’s not
  • #21: AKA Latin 1 for the Americas, Western Europe, Oceania, and much of Africa
  • #22: Superset/extension of ISO 8859-1 Adds curly quotation marks De-facto standard for Windows
  • #23: Aka Latin 2 for Central or Eastern European Languages
  • #24: UI graphics, science, and math Standard EGA VGA encoding on gfx cards
  • #27: That’s a lot! However,
  • #28: In practice, most users only used one standard locally. Which was fine...
  • #30: Standards proliferation
  • #33: (Problem) You could add more bits, but that wasted computing resources (which were scarce at the time) for users who only needed Latin or ASCII-like characters
  • #35: ATTN: 4 vs 5 char convention
  • #44: Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh) Codespace: entire range of numerical values available for encoding characters
  • #45: Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh) Codespace: entire range of numerical values available for encoding characters Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
  • #48: Unicode does not specify how the character / code point should be displayed (or encoded)!
  • #49: Unicode does not specify how the character / code point should be displayed (or encoded)!
  • #51: Combining Diacritical Marks
  • #52: In this example: 5 code points but 4 graphemes GRAPHEME = smallest unit of a writing system Think about putting cursor in this text and selecting something or pressing backspace
  • #53: “Zalgo text” or “glitch text”
  • #54: Combining Diacritical Marks
  • #55: Combining Diacritical Marks
  • #56: Combining Diacritical Marks
  • #57: Combining Diacritical Marks
  • #58: Combining Diacritical Marks
  • #59: Windows supports 52,000 family combinations
  • #60: Windows supports 52,000 family combinations
  • #61: If system lacks dedicated image, individual emojis are shown
  • #63: Combining Diacritical Marks
  • #69: Pros: Code points always use some number of bytes; very straight-forward Cons: not very memory efficient, can contain null bytes, not self-synchronizing
  • #70: BMP = basically everything except emojis and historical scripts
  • #74: “Surrogate pairs”; values are reserved, no code points with those values
  • #75: Pros: more memory efficient (most of the time), works well for BMP; is self-synchronizing Cons: 4-byte encoding logic somewhat messy; can contain null bytes
  • #96: This symbol can be encoded 4 different ways
  • #98: Intl normalizer class
  • #100: In UTF-8: 3 bytes for snowman, 1 for space, 1 for each letter c a f e, and 1 for diacritical combining acute accent mark
  • #110: Now for some fun tricks