SlideShare a Scribd company logo
RELEVANCE OF ANNOTATED
CORPUS
Thennarasu Sakkan
Annotated Text Corpora is an important resource
for advances in NLP research and for developing
different language technologies.
The annotation of corpora is done using a set of
tags, which mark the linguistic properties of a word,
sentence or discourse.
The corpora annotated with various linguistic
information not only forms a precious resource for
language technologies but also involves large
amount of effort and time.
Therefore, it is important to create corpora which
once created can be used for various purposes.
Layered approach
It was proposed to follow a layered approach. Some of the
layers are:
Layer 1: Morphology
Layer 2: POS <morphosyntactic>
Layer 3: LWG
Layer 4: Chunks
Layer 5: Syntactic Analysis
Layer 6: Thematic roles/Predicate Argument structure
Layer 7: Semantic properties of the lexical items
Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora),
etc, etc
Example,
((My younger sister
Suguna))_NP ((will be
coming))_VP ((from Tamil
Nadu))_PP ((early this
month))_NP.
((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP
((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP
!
(஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP
!!_RD_SYM (See here exclamation marker.)
((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP
கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP
((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP
஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN
த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_?
(06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP
((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_?
((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP
((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ
ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP
((._PUNC))_??
((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST?
((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN
ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF
((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC
((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP
((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP
((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_??
((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP
((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP
஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP
((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP
((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF.
((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP
((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP
((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM
Let us take sample of Malayalam Text for Chunking...

More Related Content

PDF
Hidden markov model based part of speech tagger for sinhala language
PPTX
Computational linguistics
PPT
CTS-Academic: Module 2 session 9 cognitive processes
PPTX
Hindi –tamil text translation
PPT
**JUNK** (no subject)
PDF
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
DOCX
PPTX
Language acquisition
Hidden markov model based part of speech tagger for sinhala language
Computational linguistics
CTS-Academic: Module 2 session 9 cognitive processes
Hindi –tamil text translation
**JUNK** (no subject)
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
Language acquisition

What's hot (20)

PPTX
COMPUTATIONAL LINGUISTICS
PDF
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
PPTX
Introduction to computational linguistics
PDF
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
PDF
FIRE2014_IIT-P
PDF
Language input and second language acquisition
PDF
Lesson 40
PDF
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
PDF
Computational linguistics
PDF
E10 02 (cap4)
PPT
CTS-Academic: Module 2 session 10 lesson shapes
PPTX
Computational linguistics
PDF
The structere of Language
PPTX
Computational linguistics
PDF
Teachers’ code switching in a content-focused english as a second language (e...
PPTX
Sla chapter 1 intoduction lutfiana tyas maharani
PPTX
Adilla's group corpus linguistic sec2
DOCX
Natural Language Processing
PPTX
Interlaguage
PDF
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
COMPUTATIONAL LINGUISTICS
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
Introduction to computational linguistics
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
FIRE2014_IIT-P
Language input and second language acquisition
Lesson 40
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
Computational linguistics
E10 02 (cap4)
CTS-Academic: Module 2 session 10 lesson shapes
Computational linguistics
The structere of Language
Computational linguistics
Teachers’ code switching in a content-focused english as a second language (e...
Sla chapter 1 intoduction lutfiana tyas maharani
Adilla's group corpus linguistic sec2
Natural Language Processing
Interlaguage
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
Ad

Similar to 5 relevance of annotated corpus (20)

PDF
5a use of annotated corpus
PPTX
Shallow parser for hindi language with an input from a transliterator
PDF
Unknown Words Analysis in POS Tagging of Sinhala Language
DOCX
Pos Tagging for Classical Tamil Texts
PDF
Identification of prosodic features of punjabi for enhancing the pronunciatio...
PPTX
Natural language processing
PPS
E-text in EFL - Four flavours
PDF
Tokenization in NLP Methods Types and Challenges.pdf
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
PDF
Natural Language Processing: State of The Art, Current Trends and Challenges
PDF
Poster @ enetCollect CA MC meeting in Iasi, Romania
PPTX
Introduction to lexico grammar
PDF
Ijetcas14 458
PPTX
2. Introduction to Lexico-Grammar
PDF
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
PPTX
Corpus study design
PDF
Natural language processing with python and amharic syntax parse tree by dani...
PDF
NLP Deep Learning with Tensorflow
PDF
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
5a use of annotated corpus
Shallow parser for hindi language with an input from a transliterator
Unknown Words Analysis in POS Tagging of Sinhala Language
Pos Tagging for Classical Tamil Texts
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Natural language processing
E-text in EFL - Four flavours
Tokenization in NLP Methods Types and Challenges.pdf
Welcome to International Journal of Engineering Research and Development (IJERD)
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Natural Language Processing: State of The Art, Current Trends and Challenges
Poster @ enetCollect CA MC meeting in Iasi, Romania
Introduction to lexico grammar
Ijetcas14 458
2. Introduction to Lexico-Grammar
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
Corpus study design
Natural language processing with python and amharic syntax parse tree by dani...
NLP Deep Learning with Tensorflow
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORM
Ad

More from ThennarasuSakkan (8)

PDF
11 terms in corpus linguistics1 (1)
PDF
11 terms in Corpus Linguistics1 (2)
PDF
8 issues in pos tagging
PDF
7 probability and statistics an introduction
PDF
6 shallow parsing introduction
PDF
4 salient features of corpus
PDF
2 why python for nlp
PDF
1 computational linguistics an introduction
11 terms in corpus linguistics1 (1)
11 terms in Corpus Linguistics1 (2)
8 issues in pos tagging
7 probability and statistics an introduction
6 shallow parsing introduction
4 salient features of corpus
2 why python for nlp
1 computational linguistics an introduction

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Complications of Minimal Access Surgery at WLH
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Classroom Observation Tools for Teachers
Anesthesia in Laparoscopic Surgery in India
Complications of Minimal Access Surgery at WLH
O7-L3 Supply Chain Operations - ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
01-Introduction-to-Information-Management.pdf
Cell Types and Its function , kingdom of life
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
GDM (1) (1).pptx small presentation for students
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Basic Mud Logging Guide for educational purpose
Final Presentation General Medicine 03-08-2024.pptx
Renaissance Architecture: A Journey from Faith to Humanism
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Computing-Curriculum for Schools in Ghana
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx

5 relevance of annotated corpus

  • 2. Annotated Text Corpora is an important resource for advances in NLP research and for developing different language technologies. The annotation of corpora is done using a set of tags, which mark the linguistic properties of a word, sentence or discourse. The corpora annotated with various linguistic information not only forms a precious resource for language technologies but also involves large amount of effort and time.
  • 3. Therefore, it is important to create corpora which once created can be used for various purposes. Layered approach It was proposed to follow a layered approach. Some of the layers are: Layer 1: Morphology Layer 2: POS <morphosyntactic> Layer 3: LWG Layer 4: Chunks Layer 5: Syntactic Analysis Layer 6: Thematic roles/Predicate Argument structure Layer 7: Semantic properties of the lexical items Layers 8,9,10,11: Word sense, Pronoun referents (Anaphora), etc, etc
  • 4. Example, ((My younger sister Suguna))_NP ((will be coming))_VP ((from Tamil Nadu))_PP ((early this month))_NP.
  • 5. ((செவ்஬ா஦ில்_NNP))_NP ((ச஬ற்நிக஧஥ாக_RB))_RBP ((ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((஡ர஧஦ிநங்கி஦து_VF))_VP ! (஢ாொ_NNP ஬ிஞ்ஞாணிகள்_NN))_NP ((ொ஡ரண_NN))_NP !!_RD_SYM (See here exclamation marker.) ((஢ியூ஦ார்க்_NNP))_NP :_RD_PUNC ((செவ்஬ாய்_NNP கி஧கத்ர஡_NN ஆய்வு_NN))_NP ((செய்஬஡ற்காக_RB))_RBP ((அச஥ரிக்கா_NNP))_NP ((அனுப்தி஦_VNF))_VGNF (ர஧ா஬ர்_NNP ஬ிண்கனம்_NN))_NP ((கிட்டத்஡ட்ட_RB))_RBP ((8_TC? ஥ா஡_NN த஦஠த்஡ிற்கு_NN))_NP ((திநகு_NST))_? இன்று_NST))_? (06.08.12) ((ச஬ற்நிக஧஥ாக_RB))_RBP ((஡ர஧஦ிநங்கி஦து_VF))_VP ((._PUNC))_? ((஬ிண்ச஬பி_NN ஆய்வு_NN ர஥஦த்஡ில்_NN))_NP ((இது_PRP))_?? ((ஒய௃_TC ஥ிகப்_INTF சதரி஦_JJ ர஥ல்கல்னாக_RB??))_NP?? / RBP?? ((கய௃஡ப்தடுகிநது_VF))_VP ((._PUNC))_??
  • 6. ((பூ஥ி஦ில்_NN))_NP ((இய௃ந்து_N_NST))_NP?/N_ST? ((சு஥ார்_RB)) ((570_TC ஥ில்னி஦ன்_NN கி.஥ீ.,_NN ச஡ாரன஬ில்_NN))_NP ((உள்பது_VF))_VGF ((செவ்஬ாய்_NNP கி஧கம்_NNP))_NP ._PUNC ((இந்஡_DMD கி஧கத்஡ில்_NN ஊ஦ிரிணங்கள்_NN))_NP ((஬ாழ்஬஡ற்காண_VNF))_VGNF ((஌ற்ந_JJ சூ஫ல்_NN))_NP ((இய௃க்கிந஡ா_VF))_VGF ((஋ன்தது_CCS))_?? ((குநித்து_PSP))_?? ((ஆய்வு_NN))_NP ((செய்஦_VINF))_VGINF ((அச஥ரிக்கா஬ின்_NNP ஢ாொ_NNP ஬ிண்ச஬பி_NNP ஆ஧ாய்ச்ெி_NNP ர஥஦ம்_NNP))_NP ((தல்ர஬று_JJ))_JJP ((ஆய்வுகரப_NN))_NP ((ர஥ற்சகாண்டு_VNF))_VGNF ((஬ய௃கிநது_VF))_VGF. ((செவ்஬ாய்_NNP கி஧கம்_NN))_NP ((ச஡ாடர்தாண_JJ))_JJP ((தடங்கரபயும்_NN))_NP ((அவ்஬ப்ரதாது_RB))_RBP ((ச஬பி஦ிட்டு_VNF ஬ய௃கிநது_VM))_VGF ._SYM
  • 7. Let us take sample of Malayalam Text for Chunking...