HTTP 완벽가이드 16장

HTTP: The Definitive Guide
(ch.16 국제화)
아키텍트를 꿈꾸는 사람들
Cecil

Contents
• 국제 콘텐츠를 다루기 위한 HTTP 지원
• 국제화된 URI
• 기타 고려사항

다국어 컨텐츠를 다루는 HTTP
Accept-Charset: iso-8859-1, utf-8
Accept-Language: fr, en;q=0.8
es along a list of supported content encodings in the Accept-Encoding
ader. If the HTTP request does not contain an Accept-Encoding header, a
assume that the client will accept any encoding (equivalent to passing
coding: *).
4 shows an example of Accept-Encoding in an HTTP transaction.
Requestmessage
GET /logo.gif HTTP/1.1
Accept-encoding: gzip
[...]
HTTP/1.1 200 OK
Content-type: image/gif
Content-encoding: gzip
[...]
Responsemessage
gzip
...011010011...
gunzip
.011010011...
Client
HTTP/1.1 200 OK
GET /bigfile.html HTTP/1.1
[...]
Requestmessage
Responsemessage
Content-Type: text/html; charset=utf-8
Content-Language: fr 인코딩 방식
언어 태그

언어 인코딩
only with transporting the character data and the associated language and charset
labels. The presentation of the character shapes is handled by the user’s graphics dis-
play software (browser, operating system, fonts), as shown in Figure 16-2c.
The Wrong Charset Gives the Wrong Characters
Figure 16-2. HTTP “charset” combines a character encoding scheme and a coded character set
65 LATIN CAPITAL LETTER A
66 LATIN CAPITAL LETTER B
224 ARABICTATWEEL
225 ARABIC LETTER FEH
226 ARABIC LETTER QAF
227 ARABIC LETTER KAF
...11100001
Databits
encodingscheme
(usingiso-8859-6’sencoding)
225
Charactercode
(iniso-8859-6set)
Codedcharacterset
Uniquecharacter
"ARABIC LETTER FEH"
Fontsandpresentationlogic
Glyph
(a) Decode using encoding scheme (b) Find character using coded
character set
(c) Find display shape using fonts and
formatting software
MIME charset tag describes the combination of character
encoding scheme and coded character set mapping
(iso-8859-6coded
characterset) 글자를 비트로 인코딩하고,
비트를 글자로 디코딩하는 방법
Charset: 특정 코딩된 문자 집합과
특정 문자 인코딩 구조의 결합

주요 문자 집합
• US-ASCII
• 정보 교환을 위한 미국 표준 코드 가장 많이 사용됨.
• 코드값 0~127만 사용
• ISO-8859
• 국제적인 글쓰기를 위해 필요한 글자들을 하이 비트를 위해 추가한 US-ASCII의 확장
• UCS(Universal Character Set)
• 전 세계의 모든 글자를 하나의 코딩된 문자 집합으로 표현
• 기본 집합은 50,000 글자로 구성되어 있음
• 수백만개의 글자를 위한 확장 코드 공간을 가짐

문자 인코딩 구조
• 고정폭: 8비트
• 각 코딩된 문자를 고정된 길이의 비트로 표현
• 빠르게 처리 될 수 있지만, 공간을 낭비할 우려가 있음.
• 가변폭(비모달): UTF-8
• 다른 문자 코드 번호에 다른 길이의 비트를 사용
• 자주 사용되는 글자일 수록 비트의 길이가 짧음
• 가변폭(모달): iso-2022-jp
• 다른 모드로의 전환을 위해 특별한 escape 패턴을 사용

비모달:UTF-8 vs 모달(iso-2022-jp)
8-bit
The 8-bit fixed-width identity encoding simply encodes each character code with its
corresponding 8-bit value. It supports only character sets with a code range of 256
characters. The iso-8859 family of character sets uses the 8-bit identity encoding.
UTF-8
UTF-8 is a popular character encoding scheme designed for UCS (UTF stands for
“UCS Transformation Format”). UTF-8 uses a nonmodal, variable-length encoding
for the character code values, where the leading bits of the first byte tell the length of
the encoded character in bytes, and any subsequent byte contains six bits of code
value (see Table 16-2).
If the first encoded byte has a high bit of 0, the length is just 1 byte, and the remain-
ing 7 bits contain the character code. This has the nice result of ASCII compatibility
(but not iso-8859 compatibility, because iso-8859 uses the high bit).
For example, character code 90 (ASCII “Z”) would be encoded as 1 byte (01011010),
while code 5073 (13-bit binary value 1001111010001) would be encoded into 3 bytes:
11100001 10001111 10010001
iso-2022-jp
iso-2022-jp is a widely used encoding for Japanese Internet documents. iso-2022-jp is
a variable-length, modal encoding, with all values less than 128 to prevent problems
with non–8-bit-clean software.
The encoding context always is set to one of four predefined character sets.* Special
“escape sequences” shift from one set to another. iso-2022-jp initially uses the US-
ASCII character set, but it can switch to the JIS X 0201 (JIS-Roman) character set or
the much larger JIS X 0208-1978 and JIS X 0208-1983 character sets using 3-byte
escape sequences.
Table 16-2. UTF-8 variable-width, nonmodal encoding
Character code bits Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
0–7 0ccccccc - - - - -
8–11 110ccccc 10cccccc - - - -
12–16 1110cccc 10cccccc 10cccccc - - -
17–21 11110ccc 10cccccc 10cccccc 10cccccc - -
22–26 111110cc 10cccccc 10cccccc 10cccccc 10cccccc -
27–31 1111110c 10cccccc 10cccccc 10cccccc 10cccccc 10cccccc
UTF-8: 첫 비트의 선두 비트들은 인코딩된 문자의 길이를 표현
iso-2022-jp: 확장 문자를 기반으로 네가지 미리 정의된 문자집합중 하나로 설정
The escape sequences are shown in Table 16-3. In practice, Japanese text begins with
“ESC $ @” or “ESC $ B” and ends with “ESC ( B” or “ESC ( J”.
When in the US-ASCII or JIS-Roman modes, a single byte is used per character.
When using the larger JIS X 0208 character set, two bytes are used per character
code. The encoding restricts the bytes sent to be between 33 and 126.*
Table 16-3. iso-2022-jp character set switching escape sequences
Escape sequence Resulting coded character set Bytes per code
ESC ( B US-ASCII 1
ESC ( J JIS X 0201-1976 (JIS Roman) 1
ESC $ @ JIS X 0208-1978 2
ESC $ B JIS X 0208-1983 2

언어 태그
• Regional languages (as in “sgn-US-MA” for Martha’s Vineyard sign language)
• Standardized nonvariant languages (e.g., “i-navajo”)
• Nonstandard languages (e.g., “x-snowboarder-slang”*)
Subtags
Language tags have one or more parts, separated by hyphens, called subtags:
• The first subtag called the primary subtag. The values are standardized.
• The second subtag is optional and follows its own naming standard.
• Any trailing subtags are unregistered.
The primary subtag contains only letters (A–Z). Subsequent subtags can contain let-
ters or numbers, up to eight characters in length. An example is shown in Figure 16-9.
Capitalization
All tags are case-insensitive—the tags “en” and “eN” are equivalent. However, low-
ercasing conventionally is used to represent general languages, while uppercasing is
used to signify particular countries. For example, “fr” means all languages classified
as French, while “FR” signifies the country France.†
IANA Language Tag Registrations
The values of the first and second language subtags are defined by various standards
Figure 16-9. Language tags are separated into subtags
sgn-US-MA
Firstsubtag
(signlanguage)
Secondsubtag
(America)
Thirdsubtag
(Massachusetts
regionalvariant)
Martha’sVineyard sign language
언어에 이름을 붙이기 위한 짧고 표준화된 문자열
•첫번째 서브태그: ISO-639 표준 언어 집합에 속한 언어 토큰
•두번째 서브태그: ISO3166 국가 코드와 지역 표준 집합에서 선택된 코드
•세번째 서브태그: 확장용, 특별한 규칙 없음
•ex) en-US, en-GS …

국제화된 URI
URI는 식별자의 가독성과 공유 가능성 보장을 위해
US-ASCII 만으로 구성
URI Escape: 예약된 문자나 다른 지원하지 않는 글자들을
안전하게 URI에 삽입할 수 있는 방법(% 문자 사용)
filenames that contain international characters. This is incorrect and may cause
problems with some applications.
Figure 16-10. URI characters are transported as escaped code bytes but processed unescaped
Big Sale at Joe’s
Big Sale at Joe’s
http://www.joes-hardware.com/big%20sale.txt
...
o=111
m=109
/=47
b=98
i=105
g=103
%=37
2=50
0=48
s=115
...
Externalform
(email,web,billboard,radio)
Whatyouenterandsend
(incurrentcharacterset)
...
111
109
47
98
105
103
32
115
...
Whatyouprocess
(inUS-ASCIIcharacterset)
Conceptual characters URI code bytes Unescaped ASCII code byte

기타 고려사항
• HTTP 헤더
• 반드시 US-ASCII 문자 집합의 글자로만 구성되어야 함
• 날짜
• 올바른 GMT 날짜형식을 사용을 권고
• 도메인 이름
• 국제화 도메인 이름(Internationalizing Domain Name)
• 대부분의 웹 브라우저가 퓨니코드를 지원
• 퓨니코드: 유니코드 문자열을 호스트 명에서 사용 가능한 문자로 변환하는 방법
• ex) 한글.com -> xn—bj0bj06e.com

References
• David Gourley, Brian Totty, Marjorie Sayer, Sailu Reddy,
Anshu Aggarwal. HTTP 완벽 가이드(이응준, 정상일 옮김). 서울시
마포구: 인사이트, 2014

HTTP 완벽가이드 16장

More Related Content

Viewers also liked (20)

Similar to HTTP 완벽가이드 16장 (20)

More from HyeonSeok Choi (11)

Recently uploaded (20)

HTTP 완벽가이드 16장