4 Natural Language Processing-Text Normalization.pptx

Natural Language Processing
Text Normalization
& Corpus

Text Normalization
• Conversion of text that includes ‘nonstandard’ words like numbers,
abbreviations, misspellings into normal words.
Example :
u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.
$200" would be pronounced as "two hundred dollars" in English.
• Text normalization requires being aware of what type of text is to be
normalized and how it is to be processed afterwards; there is no all-purpose
normalization procedure.

Text Normalization
• Text normalization is frequently used when converting text to speech.
• Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to
be pronounced differently depending on context.
• M, me ,mein (non standard) - mein(hindi) (standard)(Challenging )
• M school ja rahi h –
• Me schl jaaaa ri hu
• OMG – (rule based normalization)
• Gr8- great –
• $ 200 - ()

Text Normalization
• Given a string of characters in a text, what is the (reasonable) set of possible
actual words (or word sequences) that might correspond to it.
• Which of those is right for the particular context?

What is Corpus
• Corpus is a large collection of texts. It is a body of written or spoken material
upon which a linguistic analysis is based.
• The plural form of corpus is corpora.
• Some popular corpora are British National Corpus (BNC),
COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus.
• European Corpus Initiative (ECI) corpus is multilingual having 98 million words
in Turkish, Japenese, Russian, Chinese, and other languages.
• The corpus may be composed of written language, spoken language or both.
Spoken corpus is usually in the form of audio recordings.

Types of Corpus
• A corpus may be open or closed. An open corpus is one which does not
claim to contain all data from a specific area while a closed corpus does
claim to contain all or nearly all data from a particular field. Medical
corpora, for example, are closed as there can be no further input to an area.
• Monolingual corpora represent only one language while bilingual corpora
represent two languages.
• Parallel corpus
• Balanced Corpus

Balanced Corpus
What should be covered in a balanced corpus?
Balanced: covers a range of text categories
• Definition depends upon the intended uses
• No true objective measure of balance
• Usually based on proportional sampling
• Balance can be based on a text typology, a classification of text types

Uses of Corpus
• A corpus provides grammarians, lexicographers, and other interested parties
with better descriptions of a language.
• Computer-procesable corpora allow linguists to adopt the principle of total
accountability, retrieving all the occurrences of a particular word or
structure for inspection or randomly selected samples.
• Corpus analysis provide lexical information, morphosyntactic information,
semantic information and pragmatic information.

Applications of Corpus
• Corpora are used in the development of NLP tools.
• Applications include spell-checking, grammar-checking, speech recognition,
text-to-speech and speech-to-text synthesis, automatic abstraction and
indexing, information retrieval and machine translation.
• Corpora also used for creation of new dictionaries and grammars for
learners.

4 Natural Language Processing-Text Normalization.pptx

More Related Content

Similar to 4 Natural Language Processing-Text Normalization.pptx (20)

More from shiks1234 (8)

Recently uploaded (20)

4 Natural Language Processing-Text Normalization.pptx