This document discusses various aspects of multi-lingual text processing including why it is important, basic text processing techniques like tokenization, normalization, and writing systems of different languages. Some key points covered are:
- Data processing is important for good experiment performance but can be tedious; multi-lingual processing requires additional knowledge of languages.
- Tokenization units include characters, subwords, words, sentences defined by punctuation.
- Normalization includes lemmatization, stemming, unicode normalization to standardize text.
- Writing systems vary and include alphabets, abjads, abugidas, syllabaries, logographs. Challenges of each for processing are discussed for languages like Arabic, Chinese, Japanese