SlideShare a Scribd company logo
in Natural Language Processing and Human Language Technology 
Universidade Autònoma de Barcelona 
Campus de Bellaterra, November 10 and 12, 2009 
Corpus Annotation 
for corpus linguistics 
Erasmus Mundus Master 
Jorge Baptista 
Universidade Algarve 
L2F Spoken Language Laboratory, INESC ID Lisboa 
jbaptis@ualg.pt
Plan 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
2 
 corpus linguistics 
 corpus annotation 
 before you get to work with your corpus 
 once you got your eText 
 character set 
 document structure 
 DTDs 
 Evaluation of annotated corpus 
 gold-standard evaluation methods 
 Annotating a corpus for Anaphora Resolution 
 Annotating a corpus for Named Entities 
Recognition 
 Annotation tools 
 References 
http://www.visualthesaurus.com/
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
3 
Corpus linguistics 
 corpus (a definition): a large body of linguistic evidence 
typically composed of attested language use 
 machine-readable form 
 well organized collection of data 
 collected within a sampling frame, 
 designed for exploration of linguistic features 
 balance, representativeness 
 multifunctional resource, serve many 
different disciplines 
McEnery 2003, in Mitkov (ed) 2003: 448 ff.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
4 
corpus annotation 
 corpus ‘enhanced with linguistic information’ 
 analysts (humans and/or computers) 
 linguistic analysis is imposed upon the corpus 
(make explicit the implicit linguistic information) 
 encoded by reference to specified range of features 
 advantages of corpus annotation: 
 ease corpus exploitation, 
 reusability, 
 multifunctionality, 
 explicit analysis
corpus annotation (continued) 
 markup 
 metadata 
 corpus information (doc id, speaker id, sex, age, etc, 
date, review number and history, etc.) 
 information pertaining to the text as such 
 paragraphs, formatting (italics, bold) 
 annotation 
 linguistic information superimposed to text 
 POS, NE_tags, discourse-structure tags, referential 
information, syntactic tags, semantic tags (for WSD), 
etc. 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
5
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
6 
corpus annotation (continued) 
annotation process 
 automatic (lemmatization, PoS tagging: 3% error rate) 
 semi-automatic (treebank) 
 manual (reference chains for anaphora resolution)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
7 
Before you get to work with your corpus* 
 Corpus-based approach to (computational) 
linguistics 
 Quality of corpora > RESULTS 
 Methodology and procedures for corpus 
collection, preparation and distribution 
 General remarks: true problems and 
difficulties lie in the details 
 text (whatever its support) and eText (in any 
digital medium) 
* Thompson 2000 in Dale et al. 2000: 385 ff.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
8 
Once you got your eText … 
Preparation 
 in an ideal scenario 
 UNICODE (ISO 10646) encoding 
 SGML (ISO 8879) mark-up 
 in a real-world scenario 
raw text, different text-file types 
 different sources and poor metadata, 
 different encodings, 
 no markup at all, or mixed and inconsistent 
markup
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
9 
character set and encoding 
 characters: abstract objects, glyphs; 
set of integers (code-points) > set of 
characters 
 encoding : mapping computer-representable 
byte- or word-stream to sequence of code 
points 
 ASCII, UNICODE, JIS, 
ISO-Latin-1 (ISO 8859-1), UTF-8 
 choosing, recoding, word-boundaries
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
10 
document structure 
 any eText already has some structure 
 words, sentences, paragraphs, quotations, 
headings, … 
 font size and face changes 
 what to notate explicitly? 
 sentence boundaries 
(never replace orthographic symbols 
but always add sentence boundaries)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
11 
document structure (continued) 
 How is explicit structural information 
recorded? 
kim: most user-friendly and reusable way 
1. design you own idiosyncratic annotation syntax 
2. use a database 
3. use a standard markup language: SGML, XML 
a. public DTD (document-type definition): TEI, CES 
b. design your very own DTD
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
12 
document structure (continued) 
 SGML (Standard Generalized Markup 
Language) ISO 8879 
 XML (eXtensible Markup Language) 
 simplified version of SGML originally targeted at 
providing flexible document markup for the 
WWW 
 low-level grammar of annotation (how is markup 
to be distinguished from text) 
 definition of the structure of families of related 
documents or document types
Text Encoding Initiative (TEI) 
 Text Encoding Initiative (TEI) 
 sponsored by ACL, ALLC and ACH 
 guidelines to facilitate data exchange 
 standardizing mark-up or encoding of information 
stored in electronic form 
 each text (document): 
 header <teiHeader> 
 body 
 each one may have several elements 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
13
TEI 
Header <teiHeader> 
 file description <fileDesc> : 
full bibliographic description of na electronic file 
 encoding description <encodingDesc> : 
relates eText to its source(s) 
 text profile <profileDesc> : 
non-bibliographic description, languages, 
sublanguages, situation of production participants and 
settings 
 revision history <revisionDesc> : 
records changes made to file 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
14
TEI 
 Body of document 
 <p>,<s>,<w>,<c> 
<w POS=AT0>the</w> 
simplified: <w AT0>the 
 TEI scheme may be expressed in different 
formal languages: 
 SGML, XML (system independent) 
 XML (simplified SGML, for the web) 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
15
Corpus Enconding Standard (CES) 
 Corpus Enconding Standard 
 specifically designed for encoding language 
corpora 
 EAGLES (Expert Advisory Group on Language 
Engineering Standards) 
 TEI-compliant application of SGML 
 available both in SGML and XML (XCES) 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
16
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
17 
DTDs (document-type definitions) 
 context-free grammars of allowed tag 
structures 
 allowed attributes for each tag 
 up-translation 
 consistency 
 preexisting markup >replace> XML 
 sed, awk, pearl scripting 
 record every step ! (backtracking changes) 
 manual post-processing > context-sensitive patches diff
DTDs 
<?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE colHAREM [ 
<!ELEMENT colHAREM (DOC)*> 
<!ATTLIST colHAREM 
versao CDATA #REQUIRED> 
<!ELEMENT DOC (#PCDATA|ALT|EM|OMITIDO|P)*> <!ATTLIST DOC DOCID CDATA #REQUIRED> 
<!ELEMENT P (#PCDATA|ALT|EM|OMITIDO)*> 
<!ELEMENT ALT (#PCDATA|EM|OMITIDO)*> 
<!ELEMENT EM (#PCDATA)> 
<!ATTLIST EM 
ID CDATA #REQUIRED 
CATEG CDATA #IMPLIED 
TIPO CDATA #IMPLIED 
SUBTIPO CDATA #IMPLIED 
COMENT CDATA #IMPLIED 
TIPOREL CDATA #IMPLIED 
COREL CDATA #IMPLIED 
TEMPO_REF (ENUNCIACAO|TEXTUAL) #IMPLIED 
SENTIDO (ANTERIOR|POSTERIOR|SIMULT|ANTERIOR_OU_SIMUL|POSTERIOR_OU_SIMULT) #IMPLIED 
VAL_DELTA CDATA #IMPLIED 
VAL_NORM CDATA #IMPLIED> 
<!ELEMENT OMITIDO (#PCDATA|EM)*> 
]> 
<colHAREM versao="ColeccaoSegundoHAREM-2.0"> 
<DOC DOCID="cha-73943"> 
<P> 
Dividir o IRA, eis a estratégia</P> 
<P> 
Hugo Estenssoro, em Londres</P> 
<P> 
O IRA esteve esta semana na ofensiva, paralisando o aeroporto de Londres e causando prejuízos à temporada turística britânica, com 
presença obrigatória nas grandes manchetes. As bombas não explodiram, mas o IRA matou um polícia no Ulster em frente à esposa 
grávida. Foi uma violência anunciada: o líder do Sinn Fein -- o braço político do IRA -- falara poucos dias antes num «`show' 
espectacular» como resposta à iniciativa anglo-irlandesa lançada pelos primeiros-ministros da Grã-Bretanha e da República da Irlanda 
com a sua «declaração» de 15 de Dezembro do ano passado. Mas a campanha terrorista foi só parte da resposta.</P> 
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
18
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
19 
Evaluation of annotated corpus 
 machine-learning techniques 
 evaluation of NLP systems 
 analysis systems 
(linguistic input → abstract representation or 
classification) 
 gold standard (‘correct’ output) 
 analysis components: segmentation, tagging, 
information extraction and information retrieval 
Hirschman and Mani (2003) in Mitkov (ed.) 2003 : 414 ff.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
20 
gold-standard-based measures 
gold-standard evaluation methods: 
 Definition of evaluation task and an associated 
‘gold-standard’ format 
 annotation guidelines 
 annotation and scoring tools 
 validation (inter-annotator agreement) 
 annotated training and test corpora 
 release (data+tools), 
 evaluation 
 interpretation (baseline and ceiling)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
21 
Annotating a corpus for Anaphora Resolution 
John arrived. He looked tired. 
antecedent anaphor 
anaphora
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
22 
AR (continued) 
John arrived. He looked tired. 
<NE ID=267 TYPE=“person”>John</NE> 
arrived. 
<REF TYPE=pro COREF=267>He</REF> 
looked tired.
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
23 
AR (an exercise) 
 identification of all the markables (NPs) in a text 
regardless of whether they were coreferential or not 
 coref and ucoref (out of ARE) 
 relations marked between entities: 
 IDENTITY, 
 SYNONYMY, 
 GENERALISATION and 
 SPECIALISATION 
 Indirect anaphora relation was not annotated: (the 
house ... the door) 
Hasler et al. (2006); Orasan et al. (2009)
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
24 
task#1 Pronominal AR on pre-annotated texts 
 evaluation of pronoun 
algorithms 
 NPs annotated (known 
candidates) 
 only PRO NP were 
marked referential (to 
be resolved) 
 no influence from 
wrongly identified 
candidates
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
25 
task#2 Coreferential chains on pre-annotated texts 
 cluster coreferential NPs 
together in coreferential chains 
 all referential NP were marked 
(to be resolved), not only PRO 
 NPs outside coreferential 
chains were not annotated 
 no influence from wrongly 
identified candidates
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
26 
an example: NER 
www.linguateca.pt/avaliacaoconjunta
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
27 
annotation tools 
 PALinkA Perspicuous and Adjustable Links Annotator 
http://clg.wlv.ac.uk/projects/PALinkA/index.php 
 Alembic workbench a natural language engineering environment 
for the development of tagged corpora 
http://www.mitre.org/tech/alembic-workbench/ 
 ATLAS Architecture and Tools for Linguistic Analysis Systems 
http://www.nist.gov/speech/atlas/ 
 CLaRK system an XML Based System For Corpora Development 
http://www.bultreebank.org/clark/index.html 
 GATE is an architecture, framework and development 
environment for language engineering which can be also used to 
annotate texts 
http://www.gate.ac.uk/ 
 MMAX a tool for multi-modal annotation in XML, but the new 
version is no longer free 
http://mmax.eml-research.de/
Corpus Annotation for corpus linguistics, Jorge Baptista©2009 
28 
References 
Dale, Robert; Moils, Hermann; Sommers, Harold. 2000. Handbook of Natural Language Processing. New York/Basel: 
Marcel Dekker, Inc. 
Hasler, Laura K.; Naumann, K. ; Orasan, C. (2006). Guidelines for Annotation of Within-document NP Coreference 
http://clg.wlv.ac.uk/projects/NP4E/NP_guidelines_2006.pdf. 
Hajičova, E.; Abeillé, A.; Hajič, J.; Mirovský, J. 2010. Treebank annotation. in Indurkhya and Damerau (2010): 167-188. 
Hirschman, Lynette; Mani, Inderjeet. 2003. Evaluation. in Mitkov, Ruslan (ed.) 2003, pp. 414-429. 
Indurkhya, Nitin; Damerau, Fred (Eds.). 2010. Handbook of Natural Language Processing (2nd ed.). Chapman 
& Hall/CRC. 
McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003 , pp. 448-463. 
McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language Studies. An advanced resource book. 
Routledge. 
Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press. 
Mitkov, Ruslan ; Orasan, Constantin ; Evans, Richard. 1999. The importance of annotated corpora for NLP: the cases of 
anaphora resolution and clause splitting. TALN ’99 The importance of annotated corpora for NLP. 
http://clg.wlv.ac.uk/papers/mitkov-99b.pdf. 
Orăsan, Constantin; Cristea, Dan; Mitkov, Ruslan; Branco António. Anaphora Resolution Exercise: An overview. 
Proceedings of 6th Language Resources and Evaluation Conference (LREC’2008), Marrakesh, Morocco, 28 – 30 
May http://clg.wlv.ac.uk/papers/713_paper.pdf. 
Renouff, Antoinette; Kehoe, Andrew (eds.).2009. Corpus Linguistics: Refinements and Reassessments. Amsterdam/New 
York: Rodopi. 
Thompson, Henry S. 2000. Corpus Creation for Data-Intensive Linguistics. in Dale et al. (eds) 2000, pp. 385-401. 
Xiao, Richard. 2010. Corpus Creation. in Indurkhya and Damerau (2010): 147-166. 
Resources 
http://www.ldc.upenn.edu/annotation/ 
http://www.routledge.com/textbooks/0415286239

More Related Content

DOCX
Definitions, Origins and approaches of Sociolinguistics
PPTX
Corpus Linguistics
DOCX
Approaches to discoourse analysis
PPSX
Language standardization: How and why
PPTX
Stylistics and Branches in stylistics
PPTX
Pidgins and creoles
PPTX
Code switching &; code mixing
PDF
Lecture 1 introduction to syntax
Definitions, Origins and approaches of Sociolinguistics
Corpus Linguistics
Approaches to discoourse analysis
Language standardization: How and why
Stylistics and Branches in stylistics
Pidgins and creoles
Code switching &; code mixing
Lecture 1 introduction to syntax

What's hot (20)

PPTX
What is psycholinguistics revised
PPTX
Introduction to Psycholinguistics
PPTX
Stailistiks ppt
PPTX
Contrastive Analysis & Errors Analysis
PPSX
Introduction to Linguistics_The History of Linguistics
PPT
Diglossia
PPTX
Corpus linguistics
PPTX
Discourse Analysis
PPTX
Introduction to sociolinguistics ch 1 4
PPTX
Stylistic analysis
PPTX
Corpus linguistics
PPTX
Language, culture and thought
PPTX
Corpus linguistics
PPTX
Stylistics
DOC
Standardization
PPTX
Corpus linguistics the basics
PPTX
History of linguistics - Schools of Linguistics
PPTX
Contrastive analysis
PPT
Aspects of Critical discourse analysis by Ruth Wodak
PPTX
language, dialect, varietes
What is psycholinguistics revised
Introduction to Psycholinguistics
Stailistiks ppt
Contrastive Analysis & Errors Analysis
Introduction to Linguistics_The History of Linguistics
Diglossia
Corpus linguistics
Discourse Analysis
Introduction to sociolinguistics ch 1 4
Stylistic analysis
Corpus linguistics
Language, culture and thought
Corpus linguistics
Stylistics
Standardization
Corpus linguistics the basics
History of linguistics - Schools of Linguistics
Contrastive analysis
Aspects of Critical discourse analysis by Ruth Wodak
language, dialect, varietes
Ad

Viewers also liked (16)

PPT
Corpus and collocations
PPTX
collocation
PDF
Grammatical collocation
PPTX
Collocation
PPT
Problems with non equivalence at word level
DOCX
Sample collocations
PPTX
Collocations
PPTX
Collocations
PPTX
Semantic Fild and collocation
PPTX
Collocation
PPTX
Collocation
PPTX
Types of collocations
PPTX
Semantics presentation
PPT
Collocations
PPTX
Collocations presentation
Corpus and collocations
collocation
Grammatical collocation
Collocation
Problems with non equivalence at word level
Sample collocations
Collocations
Collocations
Semantic Fild and collocation
Collocation
Collocation
Types of collocations
Semantics presentation
Collocations
Collocations presentation
Ad

Similar to Corpus annotation for corpus linguistics (nov2009) (20)

PPT
Ivan Derganskyi
PDF
Nltk natural language toolkit overview and application @ PyCon.tw 2012
PPTX
Corpus Linguistics :Analytical Tools
PPT
ITU - MDD - Textural Languages and Grammars
PDF
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
PPT
Internationalisation And Globalisation
PPTX
Unit1 principle of programming language
PDF
Language processors
PPTX
BEL.bio Overview and BioDati Studio
PPTX
Nltk
PDF
Maxim Zaks: Deep dive into data serialisation
PDF
Language Server Protocol - Why the Hype?
PDF
ReST Editor - Eclipse Demo Camp Grenoble 2011
PDF
Compiler Construction | Lecture 6 | Introduction to Static Analysis
PPT
CALICO 2010 Workshop
PDF
Declare Your Language: Syntax Definition
PPTX
Open nlp presentationss
PDF
Create Your Own Language
PDF
Phrase break prediction with bidirectional encoder representations in Japanes...
PDF
C, C++ Training Institute in Chennai , Adyar
Ivan Derganskyi
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Corpus Linguistics :Analytical Tools
ITU - MDD - Textural Languages and Grammars
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
Internationalisation And Globalisation
Unit1 principle of programming language
Language processors
BEL.bio Overview and BioDati Studio
Nltk
Maxim Zaks: Deep dive into data serialisation
Language Server Protocol - Why the Hype?
ReST Editor - Eclipse Demo Camp Grenoble 2011
Compiler Construction | Lecture 6 | Introduction to Static Analysis
CALICO 2010 Workshop
Declare Your Language: Syntax Definition
Open nlp presentationss
Create Your Own Language
Phrase break prediction with bidirectional encoder representations in Japanes...
C, C++ Training Institute in Chennai , Adyar

Recently uploaded (20)

PDF
Basic Mud Logging Guide for educational purpose
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Institutional Correction lecture only . . .
PDF
Complications of Minimal Access Surgery at WLH
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
master seminar digital applications in india
PDF
Classroom Observation Tools for Teachers
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Pre independence Education in Inndia.pdf
Basic Mud Logging Guide for educational purpose
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Institutional Correction lecture only . . .
Complications of Minimal Access Surgery at WLH
human mycosis Human fungal infections are called human mycosis..pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
master seminar digital applications in india
Classroom Observation Tools for Teachers
Week 4 Term 3 Study Techniques revisited.pptx
Cell Structure & Organelles in detailed.
2.FourierTransform-ShortQuestionswithAnswers.pdf
Insiders guide to clinical Medicine.pdf
Renaissance Architecture: A Journey from Faith to Humanism
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Supply Chain Operations Speaking Notes -ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
TR - Agricultural Crops Production NC III.pdf
Pre independence Education in Inndia.pdf

Corpus annotation for corpus linguistics (nov2009)

  • 1. in Natural Language Processing and Human Language Technology Universidade Autònoma de Barcelona Campus de Bellaterra, November 10 and 12, 2009 Corpus Annotation for corpus linguistics Erasmus Mundus Master Jorge Baptista Universidade Algarve L2F Spoken Language Laboratory, INESC ID Lisboa jbaptis@ualg.pt
  • 2. Plan Corpus Annotation for corpus linguistics, Jorge Baptista©2009 2  corpus linguistics  corpus annotation  before you get to work with your corpus  once you got your eText  character set  document structure  DTDs  Evaluation of annotated corpus  gold-standard evaluation methods  Annotating a corpus for Anaphora Resolution  Annotating a corpus for Named Entities Recognition  Annotation tools  References http://www.visualthesaurus.com/
  • 3. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 3 Corpus linguistics  corpus (a definition): a large body of linguistic evidence typically composed of attested language use  machine-readable form  well organized collection of data  collected within a sampling frame,  designed for exploration of linguistic features  balance, representativeness  multifunctional resource, serve many different disciplines McEnery 2003, in Mitkov (ed) 2003: 448 ff.
  • 4. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 4 corpus annotation  corpus ‘enhanced with linguistic information’  analysts (humans and/or computers)  linguistic analysis is imposed upon the corpus (make explicit the implicit linguistic information)  encoded by reference to specified range of features  advantages of corpus annotation:  ease corpus exploitation,  reusability,  multifunctionality,  explicit analysis
  • 5. corpus annotation (continued)  markup  metadata  corpus information (doc id, speaker id, sex, age, etc, date, review number and history, etc.)  information pertaining to the text as such  paragraphs, formatting (italics, bold)  annotation  linguistic information superimposed to text  POS, NE_tags, discourse-structure tags, referential information, syntactic tags, semantic tags (for WSD), etc. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 5
  • 6. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 6 corpus annotation (continued) annotation process  automatic (lemmatization, PoS tagging: 3% error rate)  semi-automatic (treebank)  manual (reference chains for anaphora resolution)
  • 7. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 7 Before you get to work with your corpus*  Corpus-based approach to (computational) linguistics  Quality of corpora > RESULTS  Methodology and procedures for corpus collection, preparation and distribution  General remarks: true problems and difficulties lie in the details  text (whatever its support) and eText (in any digital medium) * Thompson 2000 in Dale et al. 2000: 385 ff.
  • 8. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 8 Once you got your eText … Preparation  in an ideal scenario  UNICODE (ISO 10646) encoding  SGML (ISO 8879) mark-up  in a real-world scenario raw text, different text-file types  different sources and poor metadata,  different encodings,  no markup at all, or mixed and inconsistent markup
  • 9. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 9 character set and encoding  characters: abstract objects, glyphs; set of integers (code-points) > set of characters  encoding : mapping computer-representable byte- or word-stream to sequence of code points  ASCII, UNICODE, JIS, ISO-Latin-1 (ISO 8859-1), UTF-8  choosing, recoding, word-boundaries
  • 10. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 10 document structure  any eText already has some structure  words, sentences, paragraphs, quotations, headings, …  font size and face changes  what to notate explicitly?  sentence boundaries (never replace orthographic symbols but always add sentence boundaries)
  • 11. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 11 document structure (continued)  How is explicit structural information recorded? kim: most user-friendly and reusable way 1. design you own idiosyncratic annotation syntax 2. use a database 3. use a standard markup language: SGML, XML a. public DTD (document-type definition): TEI, CES b. design your very own DTD
  • 12. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 12 document structure (continued)  SGML (Standard Generalized Markup Language) ISO 8879  XML (eXtensible Markup Language)  simplified version of SGML originally targeted at providing flexible document markup for the WWW  low-level grammar of annotation (how is markup to be distinguished from text)  definition of the structure of families of related documents or document types
  • 13. Text Encoding Initiative (TEI)  Text Encoding Initiative (TEI)  sponsored by ACL, ALLC and ACH  guidelines to facilitate data exchange  standardizing mark-up or encoding of information stored in electronic form  each text (document):  header <teiHeader>  body  each one may have several elements Corpus Annotation for corpus linguistics, Jorge Baptista©2009 13
  • 14. TEI Header <teiHeader>  file description <fileDesc> : full bibliographic description of na electronic file  encoding description <encodingDesc> : relates eText to its source(s)  text profile <profileDesc> : non-bibliographic description, languages, sublanguages, situation of production participants and settings  revision history <revisionDesc> : records changes made to file Corpus Annotation for corpus linguistics, Jorge Baptista©2009 14
  • 15. TEI  Body of document  <p>,<s>,<w>,<c> <w POS=AT0>the</w> simplified: <w AT0>the  TEI scheme may be expressed in different formal languages:  SGML, XML (system independent)  XML (simplified SGML, for the web) Corpus Annotation for corpus linguistics, Jorge Baptista©2009 15
  • 16. Corpus Enconding Standard (CES)  Corpus Enconding Standard  specifically designed for encoding language corpora  EAGLES (Expert Advisory Group on Language Engineering Standards)  TEI-compliant application of SGML  available both in SGML and XML (XCES) Corpus Annotation for corpus linguistics, Jorge Baptista©2009 16
  • 17. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 17 DTDs (document-type definitions)  context-free grammars of allowed tag structures  allowed attributes for each tag  up-translation  consistency  preexisting markup >replace> XML  sed, awk, pearl scripting  record every step ! (backtracking changes)  manual post-processing > context-sensitive patches diff
  • 18. DTDs <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE colHAREM [ <!ELEMENT colHAREM (DOC)*> <!ATTLIST colHAREM versao CDATA #REQUIRED> <!ELEMENT DOC (#PCDATA|ALT|EM|OMITIDO|P)*> <!ATTLIST DOC DOCID CDATA #REQUIRED> <!ELEMENT P (#PCDATA|ALT|EM|OMITIDO)*> <!ELEMENT ALT (#PCDATA|EM|OMITIDO)*> <!ELEMENT EM (#PCDATA)> <!ATTLIST EM ID CDATA #REQUIRED CATEG CDATA #IMPLIED TIPO CDATA #IMPLIED SUBTIPO CDATA #IMPLIED COMENT CDATA #IMPLIED TIPOREL CDATA #IMPLIED COREL CDATA #IMPLIED TEMPO_REF (ENUNCIACAO|TEXTUAL) #IMPLIED SENTIDO (ANTERIOR|POSTERIOR|SIMULT|ANTERIOR_OU_SIMUL|POSTERIOR_OU_SIMULT) #IMPLIED VAL_DELTA CDATA #IMPLIED VAL_NORM CDATA #IMPLIED> <!ELEMENT OMITIDO (#PCDATA|EM)*> ]> <colHAREM versao="ColeccaoSegundoHAREM-2.0"> <DOC DOCID="cha-73943"> <P> Dividir o IRA, eis a estratégia</P> <P> Hugo Estenssoro, em Londres</P> <P> O IRA esteve esta semana na ofensiva, paralisando o aeroporto de Londres e causando prejuízos à temporada turística britânica, com presença obrigatória nas grandes manchetes. As bombas não explodiram, mas o IRA matou um polícia no Ulster em frente à esposa grávida. Foi uma violência anunciada: o líder do Sinn Fein -- o braço político do IRA -- falara poucos dias antes num «`show' espectacular» como resposta à iniciativa anglo-irlandesa lançada pelos primeiros-ministros da Grã-Bretanha e da República da Irlanda com a sua «declaração» de 15 de Dezembro do ano passado. Mas a campanha terrorista foi só parte da resposta.</P> Corpus Annotation for corpus linguistics, Jorge Baptista©2009 18
  • 19. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 19 Evaluation of annotated corpus  machine-learning techniques  evaluation of NLP systems  analysis systems (linguistic input → abstract representation or classification)  gold standard (‘correct’ output)  analysis components: segmentation, tagging, information extraction and information retrieval Hirschman and Mani (2003) in Mitkov (ed.) 2003 : 414 ff.
  • 20. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 20 gold-standard-based measures gold-standard evaluation methods:  Definition of evaluation task and an associated ‘gold-standard’ format  annotation guidelines  annotation and scoring tools  validation (inter-annotator agreement)  annotated training and test corpora  release (data+tools),  evaluation  interpretation (baseline and ceiling)
  • 21. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 21 Annotating a corpus for Anaphora Resolution John arrived. He looked tired. antecedent anaphor anaphora
  • 22. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 22 AR (continued) John arrived. He looked tired. <NE ID=267 TYPE=“person”>John</NE> arrived. <REF TYPE=pro COREF=267>He</REF> looked tired.
  • 23. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 23 AR (an exercise)  identification of all the markables (NPs) in a text regardless of whether they were coreferential or not  coref and ucoref (out of ARE)  relations marked between entities:  IDENTITY,  SYNONYMY,  GENERALISATION and  SPECIALISATION  Indirect anaphora relation was not annotated: (the house ... the door) Hasler et al. (2006); Orasan et al. (2009)
  • 24. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 24 task#1 Pronominal AR on pre-annotated texts  evaluation of pronoun algorithms  NPs annotated (known candidates)  only PRO NP were marked referential (to be resolved)  no influence from wrongly identified candidates
  • 25. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 25 task#2 Coreferential chains on pre-annotated texts  cluster coreferential NPs together in coreferential chains  all referential NP were marked (to be resolved), not only PRO  NPs outside coreferential chains were not annotated  no influence from wrongly identified candidates
  • 26. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 26 an example: NER www.linguateca.pt/avaliacaoconjunta
  • 27. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 27 annotation tools  PALinkA Perspicuous and Adjustable Links Annotator http://clg.wlv.ac.uk/projects/PALinkA/index.php  Alembic workbench a natural language engineering environment for the development of tagged corpora http://www.mitre.org/tech/alembic-workbench/  ATLAS Architecture and Tools for Linguistic Analysis Systems http://www.nist.gov/speech/atlas/  CLaRK system an XML Based System For Corpora Development http://www.bultreebank.org/clark/index.html  GATE is an architecture, framework and development environment for language engineering which can be also used to annotate texts http://www.gate.ac.uk/  MMAX a tool for multi-modal annotation in XML, but the new version is no longer free http://mmax.eml-research.de/
  • 28. Corpus Annotation for corpus linguistics, Jorge Baptista©2009 28 References Dale, Robert; Moils, Hermann; Sommers, Harold. 2000. Handbook of Natural Language Processing. New York/Basel: Marcel Dekker, Inc. Hasler, Laura K.; Naumann, K. ; Orasan, C. (2006). Guidelines for Annotation of Within-document NP Coreference http://clg.wlv.ac.uk/projects/NP4E/NP_guidelines_2006.pdf. Hajičova, E.; Abeillé, A.; Hajič, J.; Mirovský, J. 2010. Treebank annotation. in Indurkhya and Damerau (2010): 167-188. Hirschman, Lynette; Mani, Inderjeet. 2003. Evaluation. in Mitkov, Ruslan (ed.) 2003, pp. 414-429. Indurkhya, Nitin; Damerau, Fred (Eds.). 2010. Handbook of Natural Language Processing (2nd ed.). Chapman & Hall/CRC. McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003 , pp. 448-463. McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language Studies. An advanced resource book. Routledge. Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press. Mitkov, Ruslan ; Orasan, Constantin ; Evans, Richard. 1999. The importance of annotated corpora for NLP: the cases of anaphora resolution and clause splitting. TALN ’99 The importance of annotated corpora for NLP. http://clg.wlv.ac.uk/papers/mitkov-99b.pdf. Orăsan, Constantin; Cristea, Dan; Mitkov, Ruslan; Branco António. Anaphora Resolution Exercise: An overview. Proceedings of 6th Language Resources and Evaluation Conference (LREC’2008), Marrakesh, Morocco, 28 – 30 May http://clg.wlv.ac.uk/papers/713_paper.pdf. Renouff, Antoinette; Kehoe, Andrew (eds.).2009. Corpus Linguistics: Refinements and Reassessments. Amsterdam/New York: Rodopi. Thompson, Henry S. 2000. Corpus Creation for Data-Intensive Linguistics. in Dale et al. (eds) 2000, pp. 385-401. Xiao, Richard. 2010. Corpus Creation. in Indurkhya and Damerau (2010): 147-166. Resources http://www.ldc.upenn.edu/annotation/ http://www.routledge.com/textbooks/0415286239