Multi lingual text-processing

MULTI-LINGUAL TEXTMULTI-LINGUAL TEXT
PROCESSINGPROCESSING
Kyubyong Park @ Kakao Brain
https://github.com/kyubyong/mtp

WHYWHY MULTI-LINGUAL TEXTMULTI-LINGUAL TEXT
PROCESSING?PROCESSING?
Yes! Modeling is fancy. Data processing is tedious. You
don't want to do that. I know. But from my experience
it's o en data processing that determines the
performance of your experiement rather than
modeling. If you can't avoid, it's better do it right.

WHYWHY MULTI-LINGUALMULTI-LINGUAL TEXTTEXT
PROCESSINGPROCESSING??
You can obtain many techniques of image processing
through many routes. More importantly, I'm not an
expert in it. Let me focus on text, which is one of the
two most typical modalities along with sound when
handling language .

WHY MULTI-LINGUALWHY MULTI-LINGUAL TEXTTEXT
PROCESSINGPROCESSING??
If you're interested in a single language, say, English,
it's fine. But if you touch a language you're not familiar
with for some reason, you may need some knowledge
on it.

BASIC TEXT PROCESSINGBASIC TEXT PROCESSING
(Main source:
)
Lecture slides from the Stanford
Coursera course

REGULAR EXPRESSIONSREGULAR EXPRESSIONS
Syntax for processing strings
LIBRARY (third-party): You can use unicode
category expressions such as 'p{Han}' for all
Chinese characters and 'p{Latin}' for the Latin
script.
ONLINE
SOFTWARE
regex
https://regexr.com/
PowerGrep

TOKENIZATIONTOKENIZATION
Token: a unit like character, subword ( ), word,
, sentence, etc.
Character
Simple ( )
Small vocabulary (< 100) ( )
Robust to rare words ( )
Long sequence ( )
bpe
mwe

Subword
Best performance in machine translation ( )
Robust to rare words ( )
Not intuitive ( )
Data-dependent ( )

Word
Usually simple ( )
Short sequence ( )
Transfer learning ( )
Large vocabulary (> 10000) ( )
Weak in rare words ( )

MWE (Multi-word expression)
Idioms e.g., ‘kick the bucket’
Compounds e.g., ‘San Francisco’
Phrasal verbs e.g. ‘get … across’
PROJECT Multiword Expression Project

Sentence
Usually identified by a sentenc ending symbol (.!?)
Period (.) is sometimes ambiguous.
Abbreviations like Inc. or Dr.
Numbers like .02% or 4.3

LEMMATIZATIONLEMMATIZATION
Lemma: the canonical or dictionary form of a set of
words
E.g., produce, produced, production -> produce
WHY? Dictionary lookup
HOW? Linguistic knowledge
LIBRARY nltk wordnet lemmatizer

STEMMINGSTEMMING
Stem: the part of the word that never changes even
when morphologically inflected
E.g., produce, produced, production -> produc-
WHY? Query-document match
HOW? Sequence of rules
LIBRARY nltk stemmers

UNICODE NORMALIZATIONUNICODE NORMALIZATION
(Main source: )unicode.org

Canonical equivalence: a fundamental equivalency
between characters which represent the same
abstract character
E.g., combining sequence: Ç ↔C+
E.g., ordering of combining marks: q+ + ↔
q+ +

Compatibility equivalence: a weaker type of
equivalence between characters which represent
the same abstract character, but which may have
distinct visual appearances or behaviors
E.g., circled variants: ①→ 1
E.g., width variants: ｶ→ カ

NFD: Canonical Decomposition
NFKD: Compatibility Decomposition
NFC: NFD + Canonical Composition
NFKC: NFKD + Canonical Composition

Typically NFC is desirable for string matching.
NFKC is useful if you don't want to distinguish
compatibility-equivalent characters like full- and half-
width characters.
Strip diacritics: to ASCII characters
import unicodedata
def strip_diacritics(str):
return ''.join(char for char in unicodedata.normalize('NFD'
if unicodedata.category(char) != 'Mn')

WRITING SYSTEMSWRITING SYSTEMS
(Main source: )omniglot

ALPHABETSALPHABETS
Corresponds to one or more phonemes.
Latin alphabet (AaBbCc), Cyrillic alphabet
(кириллица), Hangul (한글)

There is a fixed order.
Consonants and vowels stand alone.
Desirable for computer processing.

ABJADS (= CONSONANT ALPHABETS)ABJADS (= CONSONANT ALPHABETS)
Each letter stands for a consonant, leaving the
reader to supply the vowel.
"Cn y ndrstnd ths?"
Arabic script ( ), Hebrew script (‫ת‬‫י‬ ‫ב‬ִ‫ע‬)

Hard to learn (See )
Challenging for processing
this discussion

ABUGIDASABUGIDAS
Consonants (Primary) + Vowels (Secondary)
Devanagari (दवनागरी), Tamil (த )

SYLLABARIESSYLLABARIES
Corresponds to a syllable that is not further
decomposed.
Hiragana (ひらがな), Katakana (カタカナ)
Phonemic transcription is o en useful.
E.g., かわいい-> ka wa i i

LOGOGRAPHSLOGOGRAPHS
Each letter represents an abstract concept.
Chinese characters
Many letters
Challenging for processing
Phonemic transcription is o en useful.
E.g., 我爱你-> wǒ ài nǐ

IPA (INTERNATIONAL PHONETICIPA (INTERNATIONAL PHONETIC
ALPHABET)ALPHABET)
Universal alphabet
Each distinctive sound is represented as a single
letter. (/sh/ -> /ʃ/, /th/ -> /θ/, /ng/ -> /ŋ/)
Slashes (/ /) for phonemic transcription (e.g., 'pin'
/pɪn/ vs. 'spin' /spɪn/)
Square brackets ([ ]) for phonetic transcription. (e.g.,
'pin' [pʰɪn] vs. 'spin' [spɪn])
IPA Chart

ARPABETARPABET
Represents phonemes of American English with
ASCII characters.
Has been used in speech synthesis.
Used in the and the
dataset.
CMU Pronouncing Dictionary
TIMIT
ARPABET Symbols

ARABICARABIC
CHAR SET [p{Arabic}.9-0،!‫]؟‬
Written from right to le
Cursive
No distinct upper and lower case letter forms
Comma (،), and question mark (‫)؟‬ are diﬀerent from
those of English.
Many dialects with varying orthographies exist.
Clitics are attached to a stem any orthographic
marks like an apostrophe. (See )
‫اك‬ "your level" -> ‫ك‬ "your" + ‫ى‬ "level"
TOOL
Fahad Alotaiby et al.
Stanford Arabic Segmenter

DUTCHDUTCH
CHAR SET [ A-Za-z.!?'-0-9]
Digraph 'ij' is considered the same as 'y'. (See )this

ENGLISHENGLISH
CHAR SET [ A-Za-z.!?'-0-9]
Diacrtics are optional.
E.g., naïve = naive, façade = facade, résumé =
resume
Period (.) is used at the end of a sentence or for
abbreviations.
E.g., etc., i.e., e.g.
Most hyphens in compounds can be replaced with a
space.
E.g., state-of-the-art = state of the art

Apostrophe (') can construct clitics.
E.g. I'm (=I am), we've (=we have)
The closing quotation mark (’) and apostrophe (')
are o en mixed up. (Read )
Many words have more than one spelling. (E.g., gray
/ grey)
this

Graphemes and phonemes are not directly linked. In
other words, it's not always possible to infer the
pronunciation of a word from its spelling. Therefore
in speech synthesis a preprocessor that converts
graphemes to phonemes is o en used. (Check
)English g2p

Compared to such languages as Chinese, Japanese,
or Thai, tokenization is not so important. You can
simply divide text into sentences by [.!?] and words
by a white space, respectively at the sacrifice of
accuracy. (Check )nltk tokenize

To identify multi word expressions is not always
easy.

FRENCHFRENCH
CHAR SET [ A-Za-zçÉéÀàÈèÙùÂâÊêÎîÔôÛûœæ.!?'-0-
9]
Diacritics on captial letters are o en ignored.
Mostly two ligatures 'œ' and 'æ' are the same as 'oe'
and 'ae', respectively.
Hyphen (-) is used before a pronoun in imperative
sentences.
Donne-les-moi ! "Give them to me!""
Clitics with a apostrophe (')
E.g., je t'aime "I love you"

GERMANGERMAN
CHAR SET [ A-Za-zÄäÖöÜüẞß.!?'-0-9]
Nouns are written in capital letters.
No space for compound nouns (Check
)
E.g., Rinderwahnsinn "mad cow syndrome"
'ß' and 'ss' are interchangeable.
compound
splitter

GREEKGREEK
CHAR SET [ p{Greek}.!;'-0-9]
β (beta), θ (theta), and χ (chi) are used as phonetic
symbols in the IPA.
The letter sigma 'Σ' has two diﬀerent lowercase
forms, 'σ' and 'ς'. 'ς' is used in word-final position
and 'σ' elsewhere. (Read )
Semicolon (;) is used as a question mark.
this

HINDIHINDI
CHAR SET [p{Devanagari}0-9|?!]
Vertical line (|) is used at the end of a sentence.
Indian numbering system is special.
E.g., 1,00,00,00,000

JAPANESEJAPANESE
CHAR SET [p{Hiragana}p{Katakana}p{Han}A-Za-
z0-9０-９。、？！]
No space between words
Both full- and half-width arabic numbers are used.
Note that period, comma, question mark, and
exclamation mark are diﬀerent from English ones.

O en people depend on Romanization to input
Japanese in the digital setting. Romanization to
Japanese conversion is very important. (Check )
A morph analyzer functions as a tokenizer and a
grapheme to phoneme converter. (Check )
When は/ha/ is used as a topic marker it is
pronounced as /wa/.
this
MeCab

KOREANKOREAN
CHAR SET [ p{Hangul}A-Za-z.!?0-9]
Consonants and vowels, called 'jamo' in Korean,
combine to form a syllable, which has an
independent code point.
E.g., ㅎ(314E)+ㅏ(314F) +ㄴ(3134) ->한(D55C)
Jamo has two types: Hangul compatibility Jamo
and Hangul Jamo.

Hangul Compatibility Jamo (U+3130-U+318F)
Composes a syllable
In computer keyboards
The consonants in the onset and the coda are
identical.
Hangul Jamo (U+1100-U+11FF)
Used mostly when representing old Hangul
The consonants in the onset and the coda are
NOT identical.
If you need to decompose Hangul syllables,
Hangul Jamo is better than Hangul Compatibility
Jamo. (Check )this

Orthography is notoriously diﬀicult. For that reason
you can't expect any unoﬀicial writing will obey the
rules.
Grammar checker is hard to make. (But surprisingly
there is a decent one. Check )
Like German, many compounds are created by
merging two words without a space.
E.g., 점심시간"lunch time" (= 점심"lunch" + 시간
"time")
this

Hangul is phonetic, but the current orthography
policy respects the origin of words rather than
reflecting sound itself. As a result, sometimes the
real pronunciation of some words is diﬀerent from
its grapheme.
E.g., 독립dok rip (spelling) -> /dong nip/
(pronunciation) "independence"
TOOL
TOOL
Python-jamo: Hangul syllable decomposition
and synthesis library
KoG2P

MANDARINMANDARIN
CHAR SET [p{Han}。、，！？0-9]
There are two types of commas: ，and 、.
Ideographic comma (、) is used when enumerating
items in a list.(e.g. 红色、白色、黄色"red, white, and
yellow").
Pinyin, the standard Romanization system for
Mandarin, is used.

5 diﬀerent tones are marked by diacritics in pinyin.
mā (high level)
má (rising)
mǎ (falling and rising)
mà (falling)
ma (neutral)
There are two types of characters: simplfied and
traditional. The former is used in the mainland,
wheras the latter is used in Taiwan and Korea.

Check to see the list of characters that are
diﬀerntly used in Chinese, Japanese, and Korean.
Typically people type pinyin to input Chinese
characters in the digital setting. The pinyin to
Chinese conversion is very important. (Check )
TOOL
TOOL
TOOL
this
this
pypinyin: a python project for getting pinyin
for Chinese words or sentence
Jieba: Chinese text segmentation module
hanziconv: tool converts between simplified
and traditional Chinese Characters

PERSIANPERSIAN
CHAR SET [ p{Arabic}.9-0،!‫]؟‬
Check
When a Zero-Width Non-Joiner (ZWNJ) is used
between two characters, it forces a final form on the
preceding character. (See )
Arabic
this

PORTUGUESEPORTUGUESE
CHAR SET [ p{Latin}.?!'-0-9]
The hyphen (-) is used to make compound words
E.g., levaria + vos + os = levar-vos-ia "I would take
to you"

RUSSIANRUSSIAN
CHAR SET [ p{Cyrillic}.!?'-0-9]

SPANISHSPANISH
CHAR SET [ p{Latin}.!?¿'-0-9]
¿ is used at the beginning of a interrogative
sentence, pairing with ?.

THAITHAI
CHAR SET [ p{Thai}.!?0-9]
Space is used as a sentence separator or comma.
TOOL pythai: A collection of tools for working with
the Thai language in Python

VIETNAMESEVIETNAMESE
CHAR SET [ p{Latin}.!?'-0-9]
6 diﬀerent tones are marked by diacritics.
a (mid level)
à (low falling)
ả (mid falling)
ã (glottalized rising)
á (high rising)
ạ (glottalized falling)

Spaces are used to separate syllables, not words.
E.g., thuế thu nhập cá nhâ -> thuế "tax" + thu_nhập
"income" + cá_nhân "individual"
INFO word segmentation tools

Multi lingual text-processing

More Related Content

What's hot (19)

Similar to Multi lingual text-processing (20)

More from NAVER Engineering (20)

Recently uploaded (20)

Multi lingual text-processing