SlideShare a Scribd company logo
Notes on a Standard:
    UNICODE
     Elena-Oana Tabaranu
  elena.tabaranu@info.uaic.ro
           UAIC, Iasi
Plan
●   Introduction
●   Design Goals
●   Code Points and Characters
●   Encoding Forms, UTF-32, UTF-16, UTF-8
●   Conclusion




                                            2
Introduction
●   UNIversal character enCODing system
●   Unicode = universal character encoding scheme
    for written characters and text
●   Advantages
    ●   Consistent way of encoding multilingual text
    ●   Data stability instead of proliferating character sets
    ●   Encode ALL characters used for the written languages (> 1
        million characters can be encoded)
    ●   Creates a foundation for global software

                                                                    3
Design Principles




                    4
Characters, not Glyphs
●   The Unicode Standard draws a distinction between
    characters and glyphs.
●   Characters are the abstract representations of the
    smallest components of written language that have
    semantic value.




                                                         5
Logical Order
       ●   The order in which
           Unicode text is stored
           in the memory
           representation is
           called logical order
       ●   Unicode Standard
           includes characters to
           explicitly specify
           changes in direction
           when necessary
                                6
Code Points and Characters
●   Abstract characters are
    encoded internally as
    numbers
●   Codespace: 0 to 10FFFF16
    => 1,114,112 code points
    available
●   Abstract character -> code
    point
●   Example:
    U+0061 latin small letter a

                                     7
Encoding Forms
●   Encoding forms specify how
    each code point is to be
    expressed as a sequence of
    one or more code unit (8-bit,
    16-bit, 32-bit units)
●   Encoding forms for Unicode
    characters: UTF-8, UTF-16,
    UTF-32
●   Each form can be efficiently
    transformed into either of the
    other two without any loss of
    data

                                     8
UTF-32
●   The simplest Unicode encoding form
●   Each Unicode code point is represented directly
    by a single 32-bit code unit (fixed-width)
●   restricted to representation of code points in
    the range 0..10FFFF16
●   Example:
    U+10000 is represented as <00010000>
●   preferred encoding form for processing
    characters on most Unix platforms                 9
UTF-16
●   Code unit values often change from the code
    point value => conversion required
●   Variable-width encoding:
    ➢   U+0000..U+FFFF are represented as a single 16-bit
        code unit
    ➢   U+10000..U+10FFFF are represented as pairs of
        16-bit code units (surrogate pairs)
●   Optimized for BMP (Basic Multilingual Plain) =
    majority of common-use characters for all
    modern scripts of the world
                                                        10
UTF-8
●   UTF-8 encodes each character (code point) in 1 to 4
    octets (8-bit bytes), with the single–octet encoding
    used only for the 128 US-ASCII characters
    ●   U+0000 to U+007F → 1 byte
    ●   above → 2, 3, up to 4 bytes
●   Backwards compatible with ASCII
●   Standard for XML (XHTML) documents
●   Example:
    U+10000 is represented as <F0 90 80 80>

                                                           11
Conclusion
●   The Unicode Standard is a superset of all
    characters in widespread use today.
●   It contains characters from major international
    and national standards (e.g. the SGML
    standard) as well as prominient industry
    character sets (e.g. industy code from Apple,
    Adobe, Fujitsu, etc).
●   Responds to changing industry demands by
    encoding important new characters (e.g. the €
    sign )
                                                      12
Questions?
●   Thank You!




                              13

More Related Content

PDF
operating system structure
PDF
IT Introduction - 06. Graphic & Multimedia
PDF
Lab manual of Digital image processing using python by khalid Shaikh
PPTX
what is Font in multimedia
PPTX
MD5 ALGORITHM.pptx
PPTX
Color models
PPTX
White box testing
PPT
Multimedia software tools
operating system structure
IT Introduction - 06. Graphic & Multimedia
Lab manual of Digital image processing using python by khalid Shaikh
what is Font in multimedia
MD5 ALGORITHM.pptx
Color models
White box testing
Multimedia software tools

What's hot (20)

PPTX
Image compression in digital image processing
PPTX
Huffman coding
PPTX
Deadlock Prevention
PPTX
Multimedia chapter 5
PPT
PDF
Public Key Encryption & Hash functions
PPT
PPTX
Component and Deployment Diagram - Brief Overview
PPTX
Parsing in Compiler Design
PPTX
Osi security architecture in network.pptx
PPT
Authentication Protocols
PPT
EULER AND FERMAT THEOREM
PPTX
Smoothing in Digital Image Processing
PPTX
Graph coloring using backtracking
PPT
Intermediate code generation (Compiler Design)
PDF
Digital Image Fundamentals
PPTX
Software Configuration Management (SCM)
PPTX
Graphics software and standards
PPTX
Developing R Graphical User Interfaces
PPTX
Image compression in digital image processing
Huffman coding
Deadlock Prevention
Multimedia chapter 5
Public Key Encryption & Hash functions
Component and Deployment Diagram - Brief Overview
Parsing in Compiler Design
Osi security architecture in network.pptx
Authentication Protocols
EULER AND FERMAT THEOREM
Smoothing in Digital Image Processing
Graph coloring using backtracking
Intermediate code generation (Compiler Design)
Digital Image Fundamentals
Software Configuration Management (SCM)
Graphics software and standards
Developing R Graphical User Interfaces
Ad

Viewers also liked (20)

PPT
Multimedia Presentation
PPTX
Pablo 9r multimedia
PPT
Unit 4 and 5
PPT
Hypertext: An Overview
PPT
CLI313
PPT
Lecture # 3
PDF
Unicode (and Python)
PDF
Multimedia Technology - text
PPT
Ch04
PPTX
Multimedia file formats
PPTX
Hypertext,hypermedia and multimedia
PPTX
Hypertext, hypermedia and multimedia
PPTX
multimedia data and file format
PPTX
MultiMedia dbms
PPTX
Multimedia data and file format
PPTX
Optical Character Recognition( OCR )
PPTX
File formats and its types
PPS
Pulse modulation
PPT
Chapter 2 : TEXT
PPTX
File formats
Multimedia Presentation
Pablo 9r multimedia
Unit 4 and 5
Hypertext: An Overview
CLI313
Lecture # 3
Unicode (and Python)
Multimedia Technology - text
Ch04
Multimedia file formats
Hypertext,hypermedia and multimedia
Hypertext, hypermedia and multimedia
multimedia data and file format
MultiMedia dbms
Multimedia data and file format
Optical Character Recognition( OCR )
File formats and its types
Pulse modulation
Chapter 2 : TEXT
File formats
Ad

Similar to Notes on a Standard: Unicode (20)

PPT
Unicode
PPT
Unicode Fundamentals
PDF
Data encryption and tokenization for international unicode
PDF
Abap slide class4 unicode-plusfiles
PPSX
Character encoding and unicode format
PDF
DOC
Comprehasive Exam - IT
PPT
Lecture_ASCII and Unicode.ppt
PDF
Unicode & PHP6
PDF
Camomile : A Unicode library for OCaml
PDF
Unicode Encoding Forms
PPTX
Data Representation class 11 computer.pptx
ODP
Unicode
PDF
Unicode Primer for the Uninitiated
PPTX
What character is that
PDF
PDF
Character-Encoding-UnicodeWebinarSlides-20230515.pdf
PPTX
Ascii and Unicode (Character Codes)
PPTX
Unicode
PDF
Unicode Explained Includes Index 1st Ed Korpela Jukka K
Unicode
Unicode Fundamentals
Data encryption and tokenization for international unicode
Abap slide class4 unicode-plusfiles
Character encoding and unicode format
Comprehasive Exam - IT
Lecture_ASCII and Unicode.ppt
Unicode & PHP6
Camomile : A Unicode library for OCaml
Unicode Encoding Forms
Data Representation class 11 computer.pptx
Unicode
Unicode Primer for the Uninitiated
What character is that
Character-Encoding-UnicodeWebinarSlides-20230515.pdf
Ascii and Unicode (Character Codes)
Unicode
Unicode Explained Includes Index 1st Ed Korpela Jukka K

More from Elena-Oana Tabaranu (9)

PDF
Recunoasterea organizatiilor in postarile pe Tweeter
PPT
SXSW 2012 JavaScript MythBusters
PDF
A Survey on Unsupervised Graph-based Word Sense Disambiguation
PDF
A Survey on Unsupervised Graph-based Word Sense Disambiguation
PDF
Graph-based Word Sense Disambiguation
PDF
Semantic Tagging for the XWiki Platform with Zemanta and DBpedia
PDF
Miscarea "NoSQL" in contextul Web-ului social/semantic
PDF
Folosirea instumentului Zemanta in recomandarea de continut
ODP
Adobe Flex Framework
Recunoasterea organizatiilor in postarile pe Tweeter
SXSW 2012 JavaScript MythBusters
A Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense Disambiguation
Graph-based Word Sense Disambiguation
Semantic Tagging for the XWiki Platform with Zemanta and DBpedia
Miscarea "NoSQL" in contextul Web-ului social/semantic
Folosirea instumentului Zemanta in recomandarea de continut
Adobe Flex Framework

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
A Presentation on Artificial Intelligence
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation theory and applications.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Assigned Numbers - 2025 - Bluetooth® Document
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
A Presentation on Artificial Intelligence
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25-Week II
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Spectroscopy.pptx food analysis technology
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
gpt5_lecture_notes_comprehensive_20250812015547.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?

Notes on a Standard: Unicode

  • 1. Notes on a Standard: UNICODE Elena-Oana Tabaranu elena.tabaranu@info.uaic.ro UAIC, Iasi
  • 2. Plan ● Introduction ● Design Goals ● Code Points and Characters ● Encoding Forms, UTF-32, UTF-16, UTF-8 ● Conclusion 2
  • 3. Introduction ● UNIversal character enCODing system ● Unicode = universal character encoding scheme for written characters and text ● Advantages ● Consistent way of encoding multilingual text ● Data stability instead of proliferating character sets ● Encode ALL characters used for the written languages (> 1 million characters can be encoded) ● Creates a foundation for global software 3
  • 5. Characters, not Glyphs ● The Unicode Standard draws a distinction between characters and glyphs. ● Characters are the abstract representations of the smallest components of written language that have semantic value. 5
  • 6. Logical Order ● The order in which Unicode text is stored in the memory representation is called logical order ● Unicode Standard includes characters to explicitly specify changes in direction when necessary 6
  • 7. Code Points and Characters ● Abstract characters are encoded internally as numbers ● Codespace: 0 to 10FFFF16 => 1,114,112 code points available ● Abstract character -> code point ● Example: U+0061 latin small letter a 7
  • 8. Encoding Forms ● Encoding forms specify how each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units) ● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32 ● Each form can be efficiently transformed into either of the other two without any loss of data 8
  • 9. UTF-32 ● The simplest Unicode encoding form ● Each Unicode code point is represented directly by a single 32-bit code unit (fixed-width) ● restricted to representation of code points in the range 0..10FFFF16 ● Example: U+10000 is represented as <00010000> ● preferred encoding form for processing characters on most Unix platforms 9
  • 10. UTF-16 ● Code unit values often change from the code point value => conversion required ● Variable-width encoding: ➢ U+0000..U+FFFF are represented as a single 16-bit code unit ➢ U+10000..U+10FFFF are represented as pairs of 16-bit code units (surrogate pairs) ● Optimized for BMP (Basic Multilingual Plain) = majority of common-use characters for all modern scripts of the world 10
  • 11. UTF-8 ● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters ● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes ● Backwards compatible with ASCII ● Standard for XML (XHTML) documents ● Example: U+10000 is represented as <F0 90 80 80> 11
  • 12. Conclusion ● The Unicode Standard is a superset of all characters in widespread use today. ● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc). ● Responds to changing industry demands by encoding important new characters (e.g. the € sign ) 12
  • 13. Questions? ● Thank You! 13