SlideShare a Scribd company logo
Natural Language Processing
Text Normalization
& Corpus
Text Normalization
• Conversion of text that includes ‘nonstandard’ words like numbers,
abbreviations, misspellings into normal words.
Example :
u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.
$200" would be pronounced as "two hundred dollars" in English.
• Text normalization requires being aware of what type of text is to be
normalized and how it is to be processed afterwards; there is no all-purpose
normalization procedure.
Text Normalization
• Text normalization is frequently used when converting text to speech.
• Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to
be pronounced differently depending on context.
• M, me ,mein (non standard) - mein(hindi) (standard)(Challenging )
• M school ja rahi h –
• Me schl jaaaa ri hu
• OMG – (rule based normalization)
• Gr8- great –
• $ 200 - ()
Text Normalization
• Given a string of characters in a text, what is the (reasonable) set of possible
actual words (or word sequences) that might correspond to it.
• Which of those is right for the particular context?
What is Corpus
• Corpus is a large collection of texts. It is a body of written or spoken material
upon which a linguistic analysis is based.
• The plural form of corpus is corpora.
• Some popular corpora are British National Corpus (BNC),
COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus.
• European Corpus Initiative (ECI) corpus is multilingual having 98 million words
in Turkish, Japenese, Russian, Chinese, and other languages.
• The corpus may be composed of written language, spoken language or both.
Spoken corpus is usually in the form of audio recordings.
Types of Corpus
• A corpus may be open or closed. An open corpus is one which does not
claim to contain all data from a specific area while a closed corpus does
claim to contain all or nearly all data from a particular field. Medical
corpora, for example, are closed as there can be no further input to an area.
• Monolingual corpora represent only one language while bilingual corpora
represent two languages.
• Parallel corpus
• Balanced Corpus
Balanced Corpus
What should be covered in a balanced corpus?
Balanced: covers a range of text categories
• Definition depends upon the intended uses
• No true objective measure of balance
• Usually based on proportional sampling
• Balance can be based on a text typology, a classification of text types
Uses of Corpus
• A corpus provides grammarians, lexicographers, and other interested parties
with better descriptions of a language.
• Computer-procesable corpora allow linguists to adopt the principle of total
accountability, retrieving all the occurrences of a particular word or
structure for inspection or randomly selected samples.
• Corpus analysis provide lexical information, morphosyntactic information,
semantic information and pragmatic information.
Applications of Corpus
• Corpora are used in the development of NLP tools.
• Applications include spell-checking, grammar-checking, speech recognition,
text-to-speech and speech-to-text synthesis, automatic abstraction and
indexing, information retrieval and machine translation.
• Corpora also used for creation of new dictionaries and grammars for
learners.

More Related Content

PPTX
NLP_KASHK:Text Normalization
PPT
CHapter 2_text operation.ppt material for university students
PPTX
Natural Language Processing
PPTX
Collecting and Computerizing Data for Corpus Analyssi
PDF
learn about text preprocessing nip using nltk
PPTX
computerdictionariesandparsingppt-201216152127.pptx
PPTX
Types of corpus linguistics Parallel ,aligned...
PPTX
Computer dictionaries and_parsing_ppt
NLP_KASHK:Text Normalization
CHapter 2_text operation.ppt material for university students
Natural Language Processing
Collecting and Computerizing Data for Corpus Analyssi
learn about text preprocessing nip using nltk
computerdictionariesandparsingppt-201216152127.pptx
Types of corpus linguistics Parallel ,aligned...
Computer dictionaries and_parsing_ppt

Similar to 4 Natural Language Processing-Text Normalization.pptx (20)

PDF
Computational linguistics
PPT
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
PPTX
Discourse analysis new
PPT
2_text operationinformation retrieval. ppt
PPTX
3. introduction to text mining
PPTX
3. introduction to text mining
PPT
Natural language processing
PPTX
NLP_KASHK:POS Tagging
PPTX
Corpus study design
PPTX
natural language processing help at myassignmenthelp.net
PPTX
2, knowledge of language.pptx
PPTX
2, knowledge of language.pptx
PPTX
NLP Introduction and basics of natural language processing
PDF
E10-03 (CAP 1 Y 2)
PDF
Natural language processing module 1 chapter 1
PPTX
LOB CORPORA._Important aspects a translator needs to know
PPTX
Natural Language Processing (NLP).pptx
PPTX
Presentation1
PDF
Chapter 2 Text Operation.pdf
Computational linguistics
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
Discourse analysis new
2_text operationinformation retrieval. ppt
3. introduction to text mining
3. introduction to text mining
Natural language processing
NLP_KASHK:POS Tagging
Corpus study design
natural language processing help at myassignmenthelp.net
2, knowledge of language.pptx
2, knowledge of language.pptx
NLP Introduction and basics of natural language processing
E10-03 (CAP 1 Y 2)
Natural language processing module 1 chapter 1
LOB CORPORA._Important aspects a translator needs to know
Natural Language Processing (NLP).pptx
Presentation1
Chapter 2 Text Operation.pdf
Ad

More from shiks1234 (8)

PPTX
binomial and poisson probablity distribution for DSML
PPTX
Presentation format End Term presentation
PPTX
6 Natural Language Processing name entity recognotion-NER.pptx
PPTX
4 Natural Language Processing-Text Normalization.pptx
PPTX
1 Natural Language Processing-Intro.pptx
PPTX
Hashing a searching technique in data structures
PPTX
Statement of problem of Minor Project_ppt.pptx
PPTX
lec 2- array declaration and initialization.pptx
binomial and poisson probablity distribution for DSML
Presentation format End Term presentation
6 Natural Language Processing name entity recognotion-NER.pptx
4 Natural Language Processing-Text Normalization.pptx
1 Natural Language Processing-Intro.pptx
Hashing a searching technique in data structures
Statement of problem of Minor Project_ppt.pptx
lec 2- array declaration and initialization.pptx
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to machine learning and Linear Models
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
annual-report-2024-2025 original latest.
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Database Infoormation System (DBIS).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction-to-Cloud-ComputingFinal.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Clinical guidelines as a resource for EBP(1).pdf
ISS -ESG Data flows What is ESG and HowHow
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms
Introduction to machine learning and Linear Models
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Reliability_Chapter_ presentation 1221.5784
annual-report-2024-2025 original latest.
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
climate analysis of Dhaka ,Banglades.pptx
Foundation of Data Science unit number two notes
Fluorescence-microscope_Botany_detailed content
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

4 Natural Language Processing-Text Normalization.pptx

  • 1. Natural Language Processing Text Normalization & Corpus
  • 2. Text Normalization • Conversion of text that includes ‘nonstandard’ words like numbers, abbreviations, misspellings into normal words. Example : u r dng btr thn ny autmtc txt nrmlztion prgrm cn do. $200" would be pronounced as "two hundred dollars" in English. • Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.
  • 3. Text Normalization • Text normalization is frequently used when converting text to speech. • Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context. • M, me ,mein (non standard) - mein(hindi) (standard)(Challenging ) • M school ja rahi h – • Me schl jaaaa ri hu • OMG – (rule based normalization) • Gr8- great – • $ 200 - ()
  • 4. Text Normalization • Given a string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it. • Which of those is right for the particular context?
  • 5. What is Corpus • Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. • The plural form of corpus is corpora. • Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. • European Corpus Initiative (ECI) corpus is multilingual having 98 million words in Turkish, Japenese, Russian, Chinese, and other languages. • The corpus may be composed of written language, spoken language or both. Spoken corpus is usually in the form of audio recordings.
  • 6. Types of Corpus • A corpus may be open or closed. An open corpus is one which does not claim to contain all data from a specific area while a closed corpus does claim to contain all or nearly all data from a particular field. Medical corpora, for example, are closed as there can be no further input to an area. • Monolingual corpora represent only one language while bilingual corpora represent two languages. • Parallel corpus • Balanced Corpus
  • 7. Balanced Corpus What should be covered in a balanced corpus? Balanced: covers a range of text categories • Definition depends upon the intended uses • No true objective measure of balance • Usually based on proportional sampling • Balance can be based on a text typology, a classification of text types
  • 8. Uses of Corpus • A corpus provides grammarians, lexicographers, and other interested parties with better descriptions of a language. • Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selected samples. • Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
  • 9. Applications of Corpus • Corpora are used in the development of NLP tools. • Applications include spell-checking, grammar-checking, speech recognition, text-to-speech and speech-to-text synthesis, automatic abstraction and indexing, information retrieval and machine translation. • Corpora also used for creation of new dictionaries and grammars for learners.