SlideShare a Scribd company logo
MULTI-LINGUAL TEXTMULTI-LINGUAL TEXT
PROCESSINGPROCESSING
Kyubyong Park @ Kakao Brain
https://github.com/kyubyong/mtp
WHYWHY MULTI-LINGUAL TEXTMULTI-LINGUAL TEXT
PROCESSING?PROCESSING?
Yes! Modeling is fancy. Data processing is tedious. You
don't want to do that. I know. But from my experience
it's o en data processing that determines the
performance of your experiement rather than
modeling. If you can't avoid, it's better do it right.
WHYWHY MULTI-LINGUALMULTI-LINGUAL TEXTTEXT
PROCESSINGPROCESSING??
You can obtain many techniques of image processing
through many routes. More importantly, I'm not an
expert in it. Let me focus on text, which is one of the
two most typical modalities along with sound when
handling language .
WHY MULTI-LINGUALWHY MULTI-LINGUAL TEXTTEXT
PROCESSINGPROCESSING??
If you're interested in a single language, say, English,
it's fine. But if you touch a language you're not familiar
with for some reason, you may need some knowledge
on it.
BASIC TEXT PROCESSINGBASIC TEXT PROCESSING
(Main source:
)
Lecture slides from the Stanford
Coursera course
REGULAR EXPRESSIONSREGULAR EXPRESSIONS
Syntax for processing strings
LIBRARY (third-party): You can use unicode
category expressions such as 'p{Han}' for all
Chinese characters and 'p{Latin}' for the Latin
script.
ONLINE
SOFTWARE
regex
https://regexr.com/
PowerGrep
TOKENIZATIONTOKENIZATION
Token: a unit like character, subword ( ), word,
, sentence, etc.
Character
Simple ( )
Small vocabulary (< 100) ( )
Robust to rare words ( )
Long sequence ( )
bpe
mwe
Subword
Best performance in machine translation ( )
Robust to rare words ( )
Not intuitive ( )
Data-dependent ( )
Word
Usually simple ( )
Short sequence ( )
Transfer learning ( )
Large vocabulary (> 10000) ( )
Weak in rare words ( )
MWE (Multi-word expression)
Idioms e.g., ‘kick the bucket’
Compounds e.g., ‘San Francisco’
Phrasal verbs e.g. ‘get … across’
PROJECT Multiword Expression Project
Sentence
Usually identified by a sentenc ending symbol (.!?)
Period (.) is sometimes ambiguous.
Abbreviations like Inc. or Dr.
Numbers like .02% or 4.3
NORMALIZATIONNORMALIZATION
LEMMATIZATIONLEMMATIZATION
Lemma: the canonical or dictionary form of a set of
words
E.g., produce, produced, production -> produce
WHY? Dictionary lookup
HOW? Linguistic knowledge
LIBRARY nltk wordnet lemmatizer
STEMMINGSTEMMING
Stem: the part of the word that never changes even
when morphologically inflected
E.g., produce, produced, production -> produc-
WHY? Query-document match
HOW? Sequence of rules
LIBRARY nltk stemmers
UNICODE NORMALIZATIONUNICODE NORMALIZATION
(Main source: )unicode.org
Canonical equivalence: a fundamental equivalency
between characters which represent the same
abstract character
E.g., combining sequence: Ç ↔C+
E.g., ordering of combining marks: q+ + ↔
q+ +
Compatibility equivalence: a weaker type of
equivalence between characters which represent
the same abstract character, but which may have
distinct visual appearances or behaviors
E.g., circled variants: ①→ 1
E.g., width variants: カ→ カ
NFD: Canonical Decomposition
NFKD: Compatibility Decomposition
NFC: NFD + Canonical Composition
NFKC: NFKD + Canonical Composition
Examples
Typically NFC is desirable for string matching.
NFKC is useful if you don't want to distinguish
compatibility-equivalent characters like full- and half-
width characters.
Strip diacritics: to ASCII characters
import unicodedata
def strip_diacritics(str):
return ''.join(char for char in unicodedata.normalize('NFD'
if unicodedata.category(char) != 'Mn')
WRITING SYSTEMSWRITING SYSTEMS
(Main source: )omniglot
ALPHABETSALPHABETS
Corresponds to one or more phonemes.
Latin alphabet (AaBbCc), Cyrillic alphabet
(кириллица), Hangul (한글)
Hangul
There is a fixed order.
Consonants and vowels stand alone.
Desirable for computer processing.
ABJADS (= CONSONANT ALPHABETS)ABJADS (= CONSONANT ALPHABETS)
Each letter stands for a consonant, leaving the
reader to supply the vowel.
"Cn y ndrstnd ths?"
Arabic script ( ), Hebrew script (‫ת‬‫י‬ ‫ב‬ִ‫ע‬)
'book' in Arabic (= 'kitaab')
Hard to learn (See )
Challenging for processing
this discussion
ABUGIDASABUGIDAS
Consonants (Primary) + Vowels (Secondary)
Devanagari (दवनागरी), Tamil (த )
Devanagari compounds
SYLLABARIESSYLLABARIES
Corresponds to a syllable that is not further
decomposed.
Hiragana (ひらがな), Katakana (カタカナ)
Phonemic transcription is o en useful.
E.g., かわいい-> ka wa i i
LOGOGRAPHSLOGOGRAPHS
Each letter represents an abstract concept.
Chinese characters
Many letters
Challenging for processing
Phonemic transcription is o en useful.
E.g., 我爱你-> wǒ ài nǐ
IPA (INTERNATIONAL PHONETICIPA (INTERNATIONAL PHONETIC
ALPHABET)ALPHABET)
Universal alphabet
Each distinctive sound is represented as a single
letter. (/sh/ -> /ʃ/, /th/ -> /θ/, /ng/ -> /ŋ/)
Slashes (/ /) for phonemic transcription (e.g., 'pin'
/pɪn/ vs. 'spin' /spɪn/)
Square brackets ([ ]) for phonetic transcription. (e.g.,
'pin' [pʰɪn] vs. 'spin' [spɪn])
IPA Chart
ARPABETARPABET
Represents phonemes of American English with
ASCII characters.
Has been used in speech synthesis.
Used in the and the
dataset.
CMU Pronouncing Dictionary
TIMIT
ARPABET Symbols
LANGUAGESLANGUAGES
ARABICARABIC
CHAR SET [p{Arabic}.9-0،!‫]؟‬
Written from right to le
Cursive
No distinct upper and lower case letter forms
Comma (،), and question mark (‫)؟‬ are different from
those of English.
Many dialects with varying orthographies exist.
Clitics are attached to a stem any orthographic
marks like an apostrophe. (See )
‫اك‬ "your level" -> ‫ك‬ "your" + ‫ى‬ "level"
TOOL
Fahad Alotaiby et al.
Stanford Arabic Segmenter
DUTCHDUTCH
CHAR SET [ A-Za-z.!?'-0-9]
Digraph 'ij' is considered the same as 'y'. (See )this
ENGLISHENGLISH
CHAR SET [ A-Za-z.!?'-0-9]
Diacrtics are optional.
E.g., naïve = naive, façade = facade, résumé =
resume
Period (.) is used at the end of a sentence or for
abbreviations.
E.g., etc., i.e., e.g.
Most hyphens in compounds can be replaced with a
space.
E.g., state-of-the-art = state of the art
Apostrophe (') can construct clitics.
E.g. I'm (=I am), we've (=we have)
The closing quotation mark (’) and apostrophe (')
are o en mixed up. (Read )
Many words have more than one spelling. (E.g., gray
/ grey)
this
Graphemes and phonemes are not directly linked. In
other words, it's not always possible to infer the
pronunciation of a word from its spelling. Therefore
in speech synthesis a preprocessor that converts
graphemes to phonemes is o en used. (Check
)English g2p
Compared to such languages as Chinese, Japanese,
or Thai, tokenization is not so important. You can
simply divide text into sentences by [.!?] and words
by a white space, respectively at the sacrifice of
accuracy. (Check )nltk tokenize
To identify multi word expressions is not always
easy.
FRENCHFRENCH
CHAR SET [ A-Za-zçÉéÀàÈèÙùÂâÊêÎîÔôÛûœæ.!?'-0-
9]
Diacritics on captial letters are o en ignored.
Mostly two ligatures 'œ' and 'æ' are the same as 'oe'
and 'ae', respectively.
Hyphen (-) is used before a pronoun in imperative
sentences.
Donne-les-moi ! "Give them to me!""
Clitics with a apostrophe (')
E.g., je t'aime "I love you"
GERMANGERMAN
CHAR SET [ A-Za-zÄäÖöÜüẞß.!?'-0-9]
Nouns are written in capital letters.
No space for compound nouns (Check
)
E.g., Rinderwahnsinn "mad cow syndrome"
'ß' and 'ss' are interchangeable.
compound
splitter
GREEKGREEK
CHAR SET [ p{Greek}.!;'-0-9]
β (beta), θ (theta), and χ (chi) are used as phonetic
symbols in the IPA.
The letter sigma 'Σ' has two different lowercase
forms, 'σ' and 'ς'. 'ς' is used in word-final position
and 'σ' elsewhere. (Read )
Semicolon (;) is used as a question mark.
this
HINDIHINDI
CHAR SET [p{Devanagari}0-9|?!]
Vertical line (|) is used at the end of a sentence.
Indian numbering system is special.
E.g., 1,00,00,00,000
JAPANESEJAPANESE
CHAR SET [p{Hiragana}p{Katakana}p{Han}A-Za-
z0-90-9。、?!]
No space between words
Both full- and half-width arabic numbers are used.
Note that period, comma, question mark, and
exclamation mark are different from English ones.
O en people depend on Romanization to input
Japanese in the digital setting. Romanization to
Japanese conversion is very important. (Check )
A morph analyzer functions as a tokenizer and a
grapheme to phoneme converter. (Check )
When は/ha/ is used as a topic marker it is
pronounced as /wa/.
this
MeCab
KOREANKOREAN
CHAR SET [ p{Hangul}A-Za-z.!?0-9]
Consonants and vowels, called 'jamo' in Korean,
combine to form a syllable, which has an
independent code point.
E.g., ㅎ(314E)+ㅏ(314F) +ㄴ(3134) ->한(D55C)
Jamo has two types: Hangul compatibility Jamo
and Hangul Jamo.
Hangul Compatibility Jamo (U+3130-U+318F)
Composes a syllable
In computer keyboards
The consonants in the onset and the coda are
identical.
Hangul Jamo (U+1100-U+11FF)
Used mostly when representing old Hangul
The consonants in the onset and the coda are
NOT identical.
If you need to decompose Hangul syllables,
Hangul Jamo is better than Hangul Compatibility
Jamo. (Check )this
Orthography is notoriously difficult. For that reason
you can't expect any unofficial writing will obey the
rules.
Grammar checker is hard to make. (But surprisingly
there is a decent one. Check )
Like German, many compounds are created by
merging two words without a space.
E.g., 점심시간"lunch time" (= 점심"lunch" + 시간
"time")
this
Hangul is phonetic, but the current orthography
policy respects the origin of words rather than
reflecting sound itself. As a result, sometimes the
real pronunciation of some words is different from
its grapheme.
E.g., 독립dok rip (spelling) -> /dong nip/
(pronunciation) "independence"
TOOL
TOOL
Python-jamo: Hangul syllable decomposition
and synthesis library
KoG2P
MANDARINMANDARIN
CHAR SET [p{Han}。、,!?0-9]
There are two types of commas: ,and 、.
Ideographic comma (、) is used when enumerating
items in a list.(e.g. 红色、白色、黄色"red, white, and
yellow").
No space between words
Pinyin, the standard Romanization system for
Mandarin, is used.
5 different tones are marked by diacritics in pinyin.
mā (high level)
má (rising)
mǎ (falling and rising)
mà (falling)
ma (neutral)
There are two types of characters: simplfied and
traditional. The former is used in the mainland,
wheras the latter is used in Taiwan and Korea.
Check to see the list of characters that are
differntly used in Chinese, Japanese, and Korean.
Typically people type pinyin to input Chinese
characters in the digital setting. The pinyin to
Chinese conversion is very important. (Check )
TOOL
TOOL
TOOL
this
this
pypinyin: a python project for getting pinyin
for Chinese words or sentence
Jieba: Chinese text segmentation module
hanziconv: tool converts between simplified
and traditional Chinese Characters
PERSIANPERSIAN
CHAR SET [ p{Arabic}.9-0،!‫]؟‬
Check
When a Zero-Width Non-Joiner (ZWNJ) is used
between two characters, it forces a final form on the
preceding character. (See )
Arabic
this
PORTUGUESEPORTUGUESE
CHAR SET [ p{Latin}.?!'-0-9]
The hyphen (-) is used to make compound words
E.g., levaria + vos + os = levar-vos-ia "I would take
to you"
RUSSIANRUSSIAN
CHAR SET [ p{Cyrillic}.!?'-0-9]
SPANISHSPANISH
CHAR SET [ p{Latin}.!?¿'-0-9]
¿ is used at the beginning of a interrogative
sentence, pairing with ?.
THAITHAI
CHAR SET [ p{Thai}.!?0-9]
No space between words
Space is used as a sentence separator or comma.
TOOL pythai: A collection of tools for working with
the Thai language in Python
VIETNAMESEVIETNAMESE
CHAR SET [ p{Latin}.!?'-0-9]
6 different tones are marked by diacritics.
a (mid level)
à (low falling)
ả (mid falling)
ã (glottalized rising)
á (high rising)
ạ (glottalized falling)
Spaces are used to separate syllables, not words.
E.g., thuế thu nhập cá nhâ -> thuế "tax" + thu_nhập
"income" + cá_nhân "individual"
INFO word segmentation tools

More Related Content

PDF
God (or devil) in the details: text typography
PDF
Detail text typography
PPS
1 2-1 grammar
PPT
GROUP5-SYLLABLES
PPT
Os Group5
PPT
Os group5
PPTX
NLP_KASHK:Morphology
PPTX
Arabic 4 Basics on al-iDaafa الإضافة
God (or devil) in the details: text typography
Detail text typography
1 2-1 grammar
GROUP5-SYLLABLES
Os Group5
Os group5
NLP_KASHK:Morphology
Arabic 4 Basics on al-iDaafa الإضافة

What's hot (19)

PPTX
NLP_KASHK:POS Tagging
PPTX
Japanese Language
PDF
Morphology by To Minh Thanh
PPTX
Arabic 1: basics on nouns
PPTX
Portuguese slide by slide
PPT
Uso del diccionario bilingüe
PPT
My Version
PPT
DLIFLC E-Learning Storyboards
PDF
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
PDF
The Apatani Alphabet
PPT
Japanese writing
PPTX
Syllable and syllabification
PDF
Arabic verbs
PDF
[Andrew spencer] morphological_theory(book_fi.org) for taha
PPTX
Basic Korean 1
PDF
Learning german grammar & Vocabulary
PDF
G2 pil a grapheme to-phoneme conversion tool for the italian language
PPTX
Coreference recognition in arabic
PDF
Linguistics
NLP_KASHK:POS Tagging
Japanese Language
Morphology by To Minh Thanh
Arabic 1: basics on nouns
Portuguese slide by slide
Uso del diccionario bilingüe
My Version
DLIFLC E-Learning Storyboards
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
The Apatani Alphabet
Japanese writing
Syllable and syllabification
Arabic verbs
[Andrew spencer] morphological_theory(book_fi.org) for taha
Basic Korean 1
Learning german grammar & Vocabulary
G2 pil a grapheme to-phoneme conversion tool for the italian language
Coreference recognition in arabic
Linguistics
Ad

Similar to Multi lingual text-processing (20)

PDF
Introduction to Arabic natural language processing (Infographics)
PDF
The english-alphabetic-code
PDF
The english alphabetic code
PDF
Lecture Notes-Are Natural Languages Regular.pdf
PDF
Dynamic calls with Text To Speech
PPTX
Artificial Intelligence_NLP
PPTX
AI UNIT-3 FINAL (1).pptx
PPTX
Words _Transducers Finite state transducers in natural language processing
PDF
Stemming algorithms
PDF
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
PPT
Sslis
PPTX
NLP topic CHAPTER 2_word level analysis.pptx
PPT
GROUP5-SYLLABLES
PPT
GROUP5-SYLLABLES
PPTX
Sounds3.pptx
PPT
Introduction to Phonetic Science
PDF
B047006011
PDF
B047006011
PPTX
Language Comparison (Korean, Japanese and English)
PPT
Sound Structure
Introduction to Arabic natural language processing (Infographics)
The english-alphabetic-code
The english alphabetic code
Lecture Notes-Are Natural Languages Regular.pdf
Dynamic calls with Text To Speech
Artificial Intelligence_NLP
AI UNIT-3 FINAL (1).pptx
Words _Transducers Finite state transducers in natural language processing
Stemming algorithms
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...
Sslis
NLP topic CHAPTER 2_word level analysis.pptx
GROUP5-SYLLABLES
GROUP5-SYLLABLES
Sounds3.pptx
Introduction to Phonetic Science
B047006011
B047006011
Language Comparison (Korean, Japanese and English)
Sound Structure
Ad

More from NAVER Engineering (20)

PDF
React vac pattern
PDF
디자인 시스템에 직방 ZUIX
PDF
진화하는 디자인 시스템(걸음마 편)
PDF
서비스 운영을 위한 디자인시스템 프로젝트
PDF
BPL(Banksalad Product Language) 무야호
PDF
이번 생에 디자인 시스템은 처음이라
PDF
날고 있는 여러 비행기 넘나 들며 정비하기
PDF
쏘카프레임 구축 배경과 과정
PDF
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
PDF
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
PDF
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
PDF
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
PDF
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
PDF
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
PDF
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
PDF
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
PDF
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
PDF
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
PDF
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
PDF
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
React vac pattern
디자인 시스템에 직방 ZUIX
진화하는 디자인 시스템(걸음마 편)
서비스 운영을 위한 디자인시스템 프로젝트
BPL(Banksalad Product Language) 무야호
이번 생에 디자인 시스템은 처음이라
날고 있는 여러 비행기 넘나 들며 정비하기
쏘카프레임 구축 배경과 과정
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
SOPHOS-XG Firewall Administrator PPT.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
1. Introduction to Computer Programming.pptx
Heart disease approach using modified random forest and particle swarm optimi...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
OMC Textile Division Presentation 2021.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

Multi lingual text-processing

  • 1. MULTI-LINGUAL TEXTMULTI-LINGUAL TEXT PROCESSINGPROCESSING Kyubyong Park @ Kakao Brain https://github.com/kyubyong/mtp
  • 2. WHYWHY MULTI-LINGUAL TEXTMULTI-LINGUAL TEXT PROCESSING?PROCESSING? Yes! Modeling is fancy. Data processing is tedious. You don't want to do that. I know. But from my experience it's o en data processing that determines the performance of your experiement rather than modeling. If you can't avoid, it's better do it right.
  • 3. WHYWHY MULTI-LINGUALMULTI-LINGUAL TEXTTEXT PROCESSINGPROCESSING?? You can obtain many techniques of image processing through many routes. More importantly, I'm not an expert in it. Let me focus on text, which is one of the two most typical modalities along with sound when handling language .
  • 4. WHY MULTI-LINGUALWHY MULTI-LINGUAL TEXTTEXT PROCESSINGPROCESSING?? If you're interested in a single language, say, English, it's fine. But if you touch a language you're not familiar with for some reason, you may need some knowledge on it.
  • 5. BASIC TEXT PROCESSINGBASIC TEXT PROCESSING (Main source: ) Lecture slides from the Stanford Coursera course
  • 6. REGULAR EXPRESSIONSREGULAR EXPRESSIONS Syntax for processing strings LIBRARY (third-party): You can use unicode category expressions such as 'p{Han}' for all Chinese characters and 'p{Latin}' for the Latin script. ONLINE SOFTWARE regex https://regexr.com/ PowerGrep
  • 7. TOKENIZATIONTOKENIZATION Token: a unit like character, subword ( ), word, , sentence, etc. Character Simple ( ) Small vocabulary (< 100) ( ) Robust to rare words ( ) Long sequence ( ) bpe mwe
  • 8. Subword Best performance in machine translation ( ) Robust to rare words ( ) Not intuitive ( ) Data-dependent ( )
  • 9. Word Usually simple ( ) Short sequence ( ) Transfer learning ( ) Large vocabulary (> 10000) ( ) Weak in rare words ( )
  • 10. MWE (Multi-word expression) Idioms e.g., ‘kick the bucket’ Compounds e.g., ‘San Francisco’ Phrasal verbs e.g. ‘get … across’ PROJECT Multiword Expression Project
  • 11. Sentence Usually identified by a sentenc ending symbol (.!?) Period (.) is sometimes ambiguous. Abbreviations like Inc. or Dr. Numbers like .02% or 4.3
  • 13. LEMMATIZATIONLEMMATIZATION Lemma: the canonical or dictionary form of a set of words E.g., produce, produced, production -> produce WHY? Dictionary lookup HOW? Linguistic knowledge LIBRARY nltk wordnet lemmatizer
  • 14. STEMMINGSTEMMING Stem: the part of the word that never changes even when morphologically inflected E.g., produce, produced, production -> produc- WHY? Query-document match HOW? Sequence of rules LIBRARY nltk stemmers
  • 16. Canonical equivalence: a fundamental equivalency between characters which represent the same abstract character E.g., combining sequence: Ç ↔C+ E.g., ordering of combining marks: q+ + ↔ q+ +
  • 17. Compatibility equivalence: a weaker type of equivalence between characters which represent the same abstract character, but which may have distinct visual appearances or behaviors E.g., circled variants: ①→ 1 E.g., width variants: カ→ カ
  • 18. NFD: Canonical Decomposition NFKD: Compatibility Decomposition NFC: NFD + Canonical Composition NFKC: NFKD + Canonical Composition
  • 20. Typically NFC is desirable for string matching. NFKC is useful if you don't want to distinguish compatibility-equivalent characters like full- and half- width characters. Strip diacritics: to ASCII characters import unicodedata def strip_diacritics(str): return ''.join(char for char in unicodedata.normalize('NFD' if unicodedata.category(char) != 'Mn')
  • 22. ALPHABETSALPHABETS Corresponds to one or more phonemes. Latin alphabet (AaBbCc), Cyrillic alphabet (кириллица), Hangul (한글)
  • 24. There is a fixed order. Consonants and vowels stand alone. Desirable for computer processing.
  • 25. ABJADS (= CONSONANT ALPHABETS)ABJADS (= CONSONANT ALPHABETS) Each letter stands for a consonant, leaving the reader to supply the vowel. "Cn y ndrstnd ths?" Arabic script ( ), Hebrew script (‫ת‬‫י‬ ‫ב‬ִ‫ע‬)
  • 26. 'book' in Arabic (= 'kitaab')
  • 27. Hard to learn (See ) Challenging for processing this discussion
  • 28. ABUGIDASABUGIDAS Consonants (Primary) + Vowels (Secondary) Devanagari (दवनागरी), Tamil (த )
  • 30. SYLLABARIESSYLLABARIES Corresponds to a syllable that is not further decomposed. Hiragana (ひらがな), Katakana (カタカナ) Phonemic transcription is o en useful. E.g., かわいい-> ka wa i i
  • 31. LOGOGRAPHSLOGOGRAPHS Each letter represents an abstract concept. Chinese characters Many letters Challenging for processing Phonemic transcription is o en useful. E.g., 我爱你-> wǒ ài nǐ
  • 32. IPA (INTERNATIONAL PHONETICIPA (INTERNATIONAL PHONETIC ALPHABET)ALPHABET) Universal alphabet Each distinctive sound is represented as a single letter. (/sh/ -> /ʃ/, /th/ -> /θ/, /ng/ -> /ŋ/) Slashes (/ /) for phonemic transcription (e.g., 'pin' /pɪn/ vs. 'spin' /spɪn/) Square brackets ([ ]) for phonetic transcription. (e.g., 'pin' [pʰɪn] vs. 'spin' [spɪn]) IPA Chart
  • 33. ARPABETARPABET Represents phonemes of American English with ASCII characters. Has been used in speech synthesis. Used in the and the dataset. CMU Pronouncing Dictionary TIMIT ARPABET Symbols
  • 35. ARABICARABIC CHAR SET [p{Arabic}.9-0،!‫]؟‬ Written from right to le Cursive No distinct upper and lower case letter forms Comma (،), and question mark (‫)؟‬ are different from those of English. Many dialects with varying orthographies exist. Clitics are attached to a stem any orthographic marks like an apostrophe. (See ) ‫اك‬ "your level" -> ‫ك‬ "your" + ‫ى‬ "level" TOOL Fahad Alotaiby et al. Stanford Arabic Segmenter
  • 36. DUTCHDUTCH CHAR SET [ A-Za-z.!?'-0-9] Digraph 'ij' is considered the same as 'y'. (See )this
  • 37. ENGLISHENGLISH CHAR SET [ A-Za-z.!?'-0-9] Diacrtics are optional. E.g., naïve = naive, façade = facade, résumé = resume Period (.) is used at the end of a sentence or for abbreviations. E.g., etc., i.e., e.g. Most hyphens in compounds can be replaced with a space. E.g., state-of-the-art = state of the art
  • 38. Apostrophe (') can construct clitics. E.g. I'm (=I am), we've (=we have) The closing quotation mark (’) and apostrophe (') are o en mixed up. (Read ) Many words have more than one spelling. (E.g., gray / grey) this
  • 39. Graphemes and phonemes are not directly linked. In other words, it's not always possible to infer the pronunciation of a word from its spelling. Therefore in speech synthesis a preprocessor that converts graphemes to phonemes is o en used. (Check )English g2p
  • 40. Compared to such languages as Chinese, Japanese, or Thai, tokenization is not so important. You can simply divide text into sentences by [.!?] and words by a white space, respectively at the sacrifice of accuracy. (Check )nltk tokenize
  • 41. To identify multi word expressions is not always easy.
  • 42. FRENCHFRENCH CHAR SET [ A-Za-zçÉéÀàÈèÙùÂâÊêÎîÔôÛûœæ.!?'-0- 9] Diacritics on captial letters are o en ignored. Mostly two ligatures 'œ' and 'æ' are the same as 'oe' and 'ae', respectively. Hyphen (-) is used before a pronoun in imperative sentences. Donne-les-moi ! "Give them to me!"" Clitics with a apostrophe (') E.g., je t'aime "I love you"
  • 43. GERMANGERMAN CHAR SET [ A-Za-zÄäÖöÜüẞß.!?'-0-9] Nouns are written in capital letters. No space for compound nouns (Check ) E.g., Rinderwahnsinn "mad cow syndrome" 'ß' and 'ss' are interchangeable. compound splitter
  • 44. GREEKGREEK CHAR SET [ p{Greek}.!;'-0-9] β (beta), θ (theta), and χ (chi) are used as phonetic symbols in the IPA. The letter sigma 'Σ' has two different lowercase forms, 'σ' and 'ς'. 'ς' is used in word-final position and 'σ' elsewhere. (Read ) Semicolon (;) is used as a question mark. this
  • 45. HINDIHINDI CHAR SET [p{Devanagari}0-9|?!] Vertical line (|) is used at the end of a sentence. Indian numbering system is special. E.g., 1,00,00,00,000
  • 46. JAPANESEJAPANESE CHAR SET [p{Hiragana}p{Katakana}p{Han}A-Za- z0-90-9。、?!] No space between words Both full- and half-width arabic numbers are used. Note that period, comma, question mark, and exclamation mark are different from English ones.
  • 47. O en people depend on Romanization to input Japanese in the digital setting. Romanization to Japanese conversion is very important. (Check ) A morph analyzer functions as a tokenizer and a grapheme to phoneme converter. (Check ) When は/ha/ is used as a topic marker it is pronounced as /wa/. this MeCab
  • 48. KOREANKOREAN CHAR SET [ p{Hangul}A-Za-z.!?0-9] Consonants and vowels, called 'jamo' in Korean, combine to form a syllable, which has an independent code point. E.g., ㅎ(314E)+ㅏ(314F) +ㄴ(3134) ->한(D55C) Jamo has two types: Hangul compatibility Jamo and Hangul Jamo.
  • 49. Hangul Compatibility Jamo (U+3130-U+318F) Composes a syllable In computer keyboards The consonants in the onset and the coda are identical. Hangul Jamo (U+1100-U+11FF) Used mostly when representing old Hangul The consonants in the onset and the coda are NOT identical. If you need to decompose Hangul syllables, Hangul Jamo is better than Hangul Compatibility Jamo. (Check )this
  • 50. Orthography is notoriously difficult. For that reason you can't expect any unofficial writing will obey the rules. Grammar checker is hard to make. (But surprisingly there is a decent one. Check ) Like German, many compounds are created by merging two words without a space. E.g., 점심시간"lunch time" (= 점심"lunch" + 시간 "time") this
  • 51. Hangul is phonetic, but the current orthography policy respects the origin of words rather than reflecting sound itself. As a result, sometimes the real pronunciation of some words is different from its grapheme. E.g., 독립dok rip (spelling) -> /dong nip/ (pronunciation) "independence" TOOL TOOL Python-jamo: Hangul syllable decomposition and synthesis library KoG2P
  • 52. MANDARINMANDARIN CHAR SET [p{Han}。、,!?0-9] There are two types of commas: ,and 、. Ideographic comma (、) is used when enumerating items in a list.(e.g. 红色、白色、黄色"red, white, and yellow"). No space between words Pinyin, the standard Romanization system for Mandarin, is used.
  • 53. 5 different tones are marked by diacritics in pinyin. mā (high level) má (rising) mǎ (falling and rising) mà (falling) ma (neutral) There are two types of characters: simplfied and traditional. The former is used in the mainland, wheras the latter is used in Taiwan and Korea.
  • 54. Check to see the list of characters that are differntly used in Chinese, Japanese, and Korean. Typically people type pinyin to input Chinese characters in the digital setting. The pinyin to Chinese conversion is very important. (Check ) TOOL TOOL TOOL this this pypinyin: a python project for getting pinyin for Chinese words or sentence Jieba: Chinese text segmentation module hanziconv: tool converts between simplified and traditional Chinese Characters
  • 55. PERSIANPERSIAN CHAR SET [ p{Arabic}.9-0،!‫]؟‬ Check When a Zero-Width Non-Joiner (ZWNJ) is used between two characters, it forces a final form on the preceding character. (See ) Arabic this
  • 56. PORTUGUESEPORTUGUESE CHAR SET [ p{Latin}.?!'-0-9] The hyphen (-) is used to make compound words E.g., levaria + vos + os = levar-vos-ia "I would take to you"
  • 57. RUSSIANRUSSIAN CHAR SET [ p{Cyrillic}.!?'-0-9]
  • 58. SPANISHSPANISH CHAR SET [ p{Latin}.!?¿'-0-9] ¿ is used at the beginning of a interrogative sentence, pairing with ?.
  • 59. THAITHAI CHAR SET [ p{Thai}.!?0-9] No space between words Space is used as a sentence separator or comma. TOOL pythai: A collection of tools for working with the Thai language in Python
  • 60. VIETNAMESEVIETNAMESE CHAR SET [ p{Latin}.!?'-0-9] 6 different tones are marked by diacritics. a (mid level) à (low falling) ả (mid falling) ã (glottalized rising) á (high rising) ạ (glottalized falling)
  • 61. Spaces are used to separate syllables, not words. E.g., thuế thu nhập cá nhâ -> thuế "tax" + thu_nhập "income" + cá_nhân "individual" INFO word segmentation tools