SlideShare a Scribd company logo
Unicode
• A standard character encoding designed to
support all of the world's languages
• Unicode represents characters differently than
ASCII
• Characters are mapped to a code point
A 65
Code Point
1000001
UTF-32
UTF-16
UTF-8
UTF-32
• Uses 4 bytes (32 bits)
• Example:
– A (100 0001)
0 1 0 0 0 0 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
UTF-32
• Problem:
1 KB
in
ASCII
4 KB
in
UTF-32
UTF-16
• Stores each char in either 16-bit or two 16-bit
0 0 …. 0 0
0 0 …. 0 0 0 0 …. 0 0
16 bits
16 bits16 bits
UTF-16
• Problem:
1 KB
in
ASCII
2 KB
in
UTF-16
UTF-8
• It supports every language you’ll probably ever need.
• No need for Windows-1252 this and Windows-1253 that.
• Its code point range is from 0x00 to 0x10FFFF
• It uses a variable (1 to 4) byte encoding.
UTF-8 (1-byte)
• 1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.
• 1-byte UTF-8  ASCII
MSBit is 0
code point  representation
• Examples of 1-byte UTF-8:
– “A” -> 0100 0001
– “&” -> 0010 0110
– “5” -> 0011 01010 X X X X X X X
UTF-8 (2-byte)
• 2-byte UTF-8
code point != representation
• The code point is broken apart into two pieces.
• The five MSBits of the code point are assigned to the first byte
and the six LSBits are assigned to the second byte.
UTF-8 (2-byte)
For the first byte of 2-byte UTF-8:
• The three MSBits are set to 110
• The remaining bits are the five MSBits of the code point.
For the second byte of 2-byte UTF-8
• The two MSBits are set to 10
• The remaining bits are the six LSBits of the code point.
UTF-8 (2-byte)
1 1 0 X X X X X
1 0 X X X X X X
Leading Byte
Continuation Byte
UTF-8 (3-byte)
• 3-byte UTF-8 is used for code points in the range 0x0800 to
0xFFFF.
• 3-byte UTF-8
code point != representation
• The code point is broken apart into three pieces.
UTF-8 (3-byte)
• The four MSBits of the code point are assigned to the first
byte.
• The middle six bits are assigned to the second byte.
• The six LSBits are assigned to the third byte.
UTF-8 (3-byte)
For the first byte of 3-byte UTF-8
• The four MSBits are set to 1110
• The remaining bits are the four MSBits of the code point.
For the second byte of 3-byte UTF-8
• The two MSBits are set to 10
• The remaining bits are the six middle bits of the code point.
UTF-8 (3-byte)
For the third byte of 3-byte UTF-8
• The two MSBits are set to 10
• The remaining bits are the six LSBits of the code point.
UTF-8 (3-byte)
1 1 1 0 X X X X
1 0 X X X X X X
Leading Byte
Continuation Byte
1 0 X X X X X X
Continuation Byte
UTF-8 (4-byte)
• 4-byte UTF-8 is used for code points in the range 0x10000 to
0x10FFFF.
• 4-byte UTF-8
code point != representation
• The code point is broken apart into four pieces.
UTF-8 (4-byte)
• The three MSBits of the code point are assigned to the first
byte.
• The next six MSBits are assigned to the second byte.
• Another of the next six MSBits are assigned to the third byte.
• The six LSBits are assigned to the fourth byte.
UTF-8 (4-byte)
For the first byte of 4-byte UTF-8
• The five MSBits are set to 11110
• The remaining bits are the three MSBits of the code point.
For the second byte of 4-byte UTF-8
• The two MSBits are set to 10
• The remaining bits are the next six middle bits of the code point.
UTF-8 (4-byte)
For the third byte of 4-byte UTF-8
• The two MSBits are set to 10
• The remaining bits are the next six middle bits of the code point.
For the fourth byte of 4-byte UTF-8
• The two MSBits are set to 10
• The remaining bits are the six LSBits of the code point.
Examoles
10011100101001

More Related Content

PPTX
Ppt seminar noc
PPTX
Variable and constants in Vb.NET
PPTX
PPTX
c# keywords, identifiers and Naming Conventions
PPTX
Linux I2C
PPTX
Array and functions
PDF
Data types in c++
Ppt seminar noc
Variable and constants in Vb.NET
c# keywords, identifiers and Naming Conventions
Linux I2C
Array and functions
Data types in c++

What's hot (20)

PPT
Unicode Fundamentals
PPTX
Cn ipv4 addressing
PPTX
String Manipulation in Python
PPTX
SQL Queries Information
PPTX
Cisco router modes
PPTX
Storage class
PDF
String.ppt
PPTX
Function Parameters
PPTX
Python basic syntax
PDF
DBMS unit-3.pdf
PPTX
Upgrade Ubuntu 18.04 Security with Secureboot
PPT
Architecture of 80386(www.munnuz.co.cc)
PDF
Booth's Algorithm Fully Explained With Flow Chart PDF
PPTX
C# Strings
PDF
step by step to write a gnome-shell extension
PPTX
Introduction to ip addressing by kalyan kk
PPT
DOT Net overview
DOC
Internet programming notes
PPT
Controlling User Access -Data base
Unicode Fundamentals
Cn ipv4 addressing
String Manipulation in Python
SQL Queries Information
Cisco router modes
Storage class
String.ppt
Function Parameters
Python basic syntax
DBMS unit-3.pdf
Upgrade Ubuntu 18.04 Security with Secureboot
Architecture of 80386(www.munnuz.co.cc)
Booth's Algorithm Fully Explained With Flow Chart PDF
C# Strings
step by step to write a gnome-shell extension
Introduction to ip addressing by kalyan kk
DOT Net overview
Internet programming notes
Controlling User Access -Data base
Ad

Similar to 4 character encoding-unicode (20)

PDF
Unicode Encoding Forms
PDF
Data encryption and tokenization for international unicode
PPT
Lecture_ASCII and Unicode.ppt
PDF
expect("").length.toBe(1)
PDF
Notes on a Standard: Unicode
ODP
Unicode
PPTX
Unicode
PPT
Unicode
PPTX
What character is that
PDF
Abap slide class4 unicode-plusfiles
PDF
Unicode Primer for the Uninitiated
PDF
PDF
UTF-8: The Secret of Character Encoding
PDF
Bt0068 computer organization and architecture 2
PPTX
ASCII and Unicode.pptx
PDF
The sad, strange tale of the boy with only two fingers: an introduction to b...
PPSX
Character encoding and unicode format
PDF
Computer organization and assembly language .pdf
PPTX
Bits, bytes and characters
PDF
Binary codes
Unicode Encoding Forms
Data encryption and tokenization for international unicode
Lecture_ASCII and Unicode.ppt
expect("").length.toBe(1)
Notes on a Standard: Unicode
Unicode
Unicode
Unicode
What character is that
Abap slide class4 unicode-plusfiles
Unicode Primer for the Uninitiated
UTF-8: The Secret of Character Encoding
Bt0068 computer organization and architecture 2
ASCII and Unicode.pptx
The sad, strange tale of the boy with only two fingers: an introduction to b...
Character encoding and unicode format
Computer organization and assembly language .pdf
Bits, bytes and characters
Binary codes
Ad

More from irdginfo (20)

PPTX
Quicksort Presentation
PPTX
10 merge sort
PPTX
9 big o-notation
PPTX
8 elementary sorts-bubble
PPTX
8 elementary sorts-shell
PPTX
8 elementary sorts-insertion
PPTX
8 elementary sorts-selection
PPTX
7 searching injava-binary
PPTX
6 arrays injava
PPTX
5 data structures-hashtable
PPTX
5 data structures-tree
PPTX
5 data structures-stack
PPTX
5 data structures-arraysandlinkedlist
PPTX
4 character encoding-ascii
PPTX
4 character encoding
PPTX
3 number systems-floatingpoint
PPTX
2 number systems-scientificnotation
PPTX
1 number systems-hex
PPTX
1 number systems-unsignedsignedintegers
PPTX
1 number systems-octal
Quicksort Presentation
10 merge sort
9 big o-notation
8 elementary sorts-bubble
8 elementary sorts-shell
8 elementary sorts-insertion
8 elementary sorts-selection
7 searching injava-binary
6 arrays injava
5 data structures-hashtable
5 data structures-tree
5 data structures-stack
5 data structures-arraysandlinkedlist
4 character encoding-ascii
4 character encoding
3 number systems-floatingpoint
2 number systems-scientificnotation
1 number systems-hex
1 number systems-unsignedsignedintegers
1 number systems-octal

Recently uploaded (20)

PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Cell Structure & Organelles in detailed.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Pharma ospi slides which help in ospi learning
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Institutional Correction lecture only . . .
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Anesthesia in Laparoscopic Surgery in India
Cell Structure & Organelles in detailed.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPH.pptx obstetrics and gynecology in nursing
Renaissance Architecture: A Journey from Faith to Humanism
human mycosis Human fungal infections are called human mycosis..pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Microbial diseases, their pathogenesis and prophylaxis
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pharma ospi slides which help in ospi learning
GDM (1) (1).pptx small presentation for students
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Institutional Correction lecture only . . .
Final Presentation General Medicine 03-08-2024.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pre independence Education in Inndia.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf

4 character encoding-unicode

  • 1. Unicode • A standard character encoding designed to support all of the world's languages • Unicode represents characters differently than ASCII • Characters are mapped to a code point A 65 Code Point 1000001 UTF-32 UTF-16 UTF-8
  • 2. UTF-32 • Uses 4 bytes (32 bits) • Example: – A (100 0001) 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 4. UTF-16 • Stores each char in either 16-bit or two 16-bit 0 0 …. 0 0 0 0 …. 0 0 0 0 …. 0 0 16 bits 16 bits16 bits
  • 6. UTF-8 • It supports every language you’ll probably ever need. • No need for Windows-1252 this and Windows-1253 that. • Its code point range is from 0x00 to 0x10FFFF • It uses a variable (1 to 4) byte encoding.
  • 7. UTF-8 (1-byte) • 1-byte UTF-8 is used for code points in the range 0x00 to 0x7F. • 1-byte UTF-8  ASCII MSBit is 0 code point  representation • Examples of 1-byte UTF-8: – “A” -> 0100 0001 – “&” -> 0010 0110 – “5” -> 0011 01010 X X X X X X X
  • 8. UTF-8 (2-byte) • 2-byte UTF-8 code point != representation • The code point is broken apart into two pieces. • The five MSBits of the code point are assigned to the first byte and the six LSBits are assigned to the second byte.
  • 9. UTF-8 (2-byte) For the first byte of 2-byte UTF-8: • The three MSBits are set to 110 • The remaining bits are the five MSBits of the code point. For the second byte of 2-byte UTF-8 • The two MSBits are set to 10 • The remaining bits are the six LSBits of the code point.
  • 10. UTF-8 (2-byte) 1 1 0 X X X X X 1 0 X X X X X X Leading Byte Continuation Byte
  • 11. UTF-8 (3-byte) • 3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF. • 3-byte UTF-8 code point != representation • The code point is broken apart into three pieces.
  • 12. UTF-8 (3-byte) • The four MSBits of the code point are assigned to the first byte. • The middle six bits are assigned to the second byte. • The six LSBits are assigned to the third byte.
  • 13. UTF-8 (3-byte) For the first byte of 3-byte UTF-8 • The four MSBits are set to 1110 • The remaining bits are the four MSBits of the code point. For the second byte of 3-byte UTF-8 • The two MSBits are set to 10 • The remaining bits are the six middle bits of the code point.
  • 14. UTF-8 (3-byte) For the third byte of 3-byte UTF-8 • The two MSBits are set to 10 • The remaining bits are the six LSBits of the code point.
  • 15. UTF-8 (3-byte) 1 1 1 0 X X X X 1 0 X X X X X X Leading Byte Continuation Byte 1 0 X X X X X X Continuation Byte
  • 16. UTF-8 (4-byte) • 4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF. • 4-byte UTF-8 code point != representation • The code point is broken apart into four pieces.
  • 17. UTF-8 (4-byte) • The three MSBits of the code point are assigned to the first byte. • The next six MSBits are assigned to the second byte. • Another of the next six MSBits are assigned to the third byte. • The six LSBits are assigned to the fourth byte.
  • 18. UTF-8 (4-byte) For the first byte of 4-byte UTF-8 • The five MSBits are set to 11110 • The remaining bits are the three MSBits of the code point. For the second byte of 4-byte UTF-8 • The two MSBits are set to 10 • The remaining bits are the next six middle bits of the code point.
  • 19. UTF-8 (4-byte) For the third byte of 4-byte UTF-8 • The two MSBits are set to 10 • The remaining bits are the next six middle bits of the code point. For the fourth byte of 4-byte UTF-8 • The two MSBits are set to 10 • The remaining bits are the six LSBits of the code point.