Corpus annotation for corpus linguistics (nov2009)

in Natural Language Processing and Human Language Technology
Universidade Autònoma de Barcelona
Campus de Bellaterra, November 10 and 12, 2009
Corpus Annotation
for corpus linguistics
Erasmus Mundus Master
Jorge Baptista
Universidade Algarve
L2F Spoken Language Laboratory, INESC ID Lisboa
jbaptis@ualg.pt

Plan
Corpus Annotation for corpus linguistics, Jorge Baptista©2009
2
 corpus linguistics
 corpus annotation
 before you get to work with your corpus
 once you got your eText
 character set
 document structure
 DTDs
 Evaluation of annotated corpus
 gold-standard evaluation methods
 Annotating a corpus for Anaphora Resolution
 Annotating a corpus for Named Entities
Recognition
 Annotation tools
 References
http://www.visualthesaurus.com/

3
Corpus linguistics
 corpus (a definition): a large body of linguistic evidence
typically composed of attested language use
 machine-readable form
 well organized collection of data
 collected within a sampling frame,
 designed for exploration of linguistic features
 balance, representativeness
 multifunctional resource, serve many
different disciplines
McEnery 2003, in Mitkov (ed) 2003: 448 ff.

4
corpus annotation
 corpus ‘enhanced with linguistic information’
 analysts (humans and/or computers)
 linguistic analysis is imposed upon the corpus
(make explicit the implicit linguistic information)
 encoded by reference to specified range of features
 advantages of corpus annotation:
 ease corpus exploitation,
 reusability,
 multifunctionality,
 explicit analysis

corpus annotation (continued)
 markup
 metadata
 corpus information (doc id, speaker id, sex, age, etc,
date, review number and history, etc.)
 information pertaining to the text as such
 paragraphs, formatting (italics, bold)
 annotation
 linguistic information superimposed to text
 POS, NE_tags, discourse-structure tags, referential
information, syntactic tags, semantic tags (for WSD),
etc.
5

6
corpus annotation (continued)
annotation process
 automatic (lemmatization, PoS tagging: 3% error rate)
 semi-automatic (treebank)
 manual (reference chains for anaphora resolution)

7
Before you get to work with your corpus*
 Corpus-based approach to (computational)
linguistics
 Quality of corpora > RESULTS
 Methodology and procedures for corpus
collection, preparation and distribution
 General remarks: true problems and
difficulties lie in the details
 text (whatever its support) and eText (in any
digital medium)
* Thompson 2000 in Dale et al. 2000: 385 ff.

8
Once you got your eText …
Preparation
 in an ideal scenario
 UNICODE (ISO 10646) encoding
 SGML (ISO 8879) mark-up
 in a real-world scenario
raw text, different text-file types
 different sources and poor metadata,
 different encodings,
 no markup at all, or mixed and inconsistent
markup

9
character set and encoding
 characters: abstract objects, glyphs;
set of integers (code-points) > set of
characters
 encoding : mapping computer-representable
byte- or word-stream to sequence of code
points
 ASCII, UNICODE, JIS,
ISO-Latin-1 (ISO 8859-1), UTF-8
 choosing, recoding, word-boundaries

10
document structure
 any eText already has some structure
 words, sentences, paragraphs, quotations,
headings, …
 font size and face changes
 what to notate explicitly?
 sentence boundaries
(never replace orthographic symbols
but always add sentence boundaries)

11
document structure (continued)
 How is explicit structural information
recorded?
kim: most user-friendly and reusable way
1. design you own idiosyncratic annotation syntax
2. use a database
3. use a standard markup language: SGML, XML
a. public DTD (document-type definition): TEI, CES
b. design your very own DTD

12
document structure (continued)
 SGML (Standard Generalized Markup
Language) ISO 8879
 XML (eXtensible Markup Language)
 simplified version of SGML originally targeted at
providing flexible document markup for the
WWW
 low-level grammar of annotation (how is markup
to be distinguished from text)
 definition of the structure of families of related
documents or document types

Text Encoding Initiative (TEI)
 Text Encoding Initiative (TEI)
 sponsored by ACL, ALLC and ACH
 guidelines to facilitate data exchange
 standardizing mark-up or encoding of information
stored in electronic form
 each text (document):
 header <teiHeader>
 body
 each one may have several elements
13

TEI
Header <teiHeader>
 file description <fileDesc> :
full bibliographic description of na electronic file
 encoding description <encodingDesc> :
relates eText to its source(s)
 text profile <profileDesc> :
non-bibliographic description, languages,
sublanguages, situation of production participants and
settings
 revision history <revisionDesc> :
records changes made to file
14

TEI
 Body of document
 ,<s>,<w>,<c>
<w POS=AT0>the</w>
simplified: <w AT0>the
 TEI scheme may be expressed in different
formal languages:
 SGML, XML (system independent)
 XML (simplified SGML, for the web)
15

Corpus Enconding Standard (CES)
 Corpus Enconding Standard
 specifically designed for encoding language
corpora
 EAGLES (Expert Advisory Group on Language
Engineering Standards)
 TEI-compliant application of SGML
 available both in SGML and XML (XCES)
16

17
DTDs (document-type definitions)
 context-free grammars of allowed tag
structures
 allowed attributes for each tag
 up-translation
 consistency
 preexisting markup >replace> XML
 sed, awk, pearl scripting
 record every step ! (backtracking changes)
 manual post-processing > context-sensitive patches diff

DTDs
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE colHAREM [
<!ELEMENT colHAREM (DOC)*>
<!ATTLIST colHAREM
versao CDATA #REQUIRED>
<!ELEMENT DOC (#PCDATA|ALT|EM|OMITIDO|P)*> <!ATTLIST DOC DOCID CDATA #REQUIRED>
<!ELEMENT P (#PCDATA|ALT|EM|OMITIDO)*>
<!ELEMENT ALT (#PCDATA|EM|OMITIDO)*>
<!ELEMENT EM (#PCDATA)>
<!ATTLIST EM
ID CDATA #REQUIRED
CATEG CDATA #IMPLIED
TIPO CDATA #IMPLIED
SUBTIPO CDATA #IMPLIED
COMENT CDATA #IMPLIED
TIPOREL CDATA #IMPLIED
COREL CDATA #IMPLIED
TEMPO_REF (ENUNCIACAO|TEXTUAL) #IMPLIED
SENTIDO (ANTERIOR|POSTERIOR|SIMULT|ANTERIOR_OU_SIMUL|POSTERIOR_OU_SIMULT) #IMPLIED
VAL_DELTA CDATA #IMPLIED
VAL_NORM CDATA #IMPLIED>
<!ELEMENT OMITIDO (#PCDATA|EM)*>
]>
<colHAREM versao="ColeccaoSegundoHAREM-2.0">
<DOC DOCID="cha-73943">

Dividir o IRA, eis a estratégia

Hugo Estenssoro, em Londres

O IRA esteve esta semana na ofensiva, paralisando o aeroporto de Londres e causando prejuízos à temporada turística britânica, com
presença obrigatória nas grandes manchetes. As bombas não explodiram, mas o IRA matou um polícia no Ulster em frente à esposa
grávida. Foi uma violência anunciada: o líder do Sinn Fein -- o braço político do IRA -- falara poucos dias antes num «`show'
espectacular» como resposta à iniciativa anglo-irlandesa lançada pelos primeiros-ministros da Grã-Bretanha e da República da Irlanda
com a sua «declaração» de 15 de Dezembro do ano passado. Mas a campanha terrorista foi só parte da resposta.
18

19
Evaluation of annotated corpus
 machine-learning techniques
 evaluation of NLP systems
 analysis systems
(linguistic input → abstract representation or
classification)
 gold standard (‘correct’ output)
 analysis components: segmentation, tagging,
information extraction and information retrieval
Hirschman and Mani (2003) in Mitkov (ed.) 2003 : 414 ff.

20
gold-standard-based measures
gold-standard evaluation methods:
 Definition of evaluation task and an associated
‘gold-standard’ format
 annotation guidelines
 annotation and scoring tools
 validation (inter-annotator agreement)
 annotated training and test corpora
 release (data+tools),
 evaluation
 interpretation (baseline and ceiling)

21
Annotating a corpus for Anaphora Resolution
John arrived. He looked tired.
antecedent anaphor
anaphora

22
AR (continued)
John arrived. He looked tired.
<NE ID=267 TYPE=“person”>John</NE>
arrived.
<REF TYPE=pro COREF=267>He</REF>
looked tired.

23
AR (an exercise)
 identification of all the markables (NPs) in a text
regardless of whether they were coreferential or not
 coref and ucoref (out of ARE)
 relations marked between entities:
 IDENTITY,
 SYNONYMY,
 GENERALISATION and
 SPECIALISATION
 Indirect anaphora relation was not annotated: (the
house ... the door)
Hasler et al. (2006); Orasan et al. (2009)

24
task#1 Pronominal AR on pre-annotated texts
 evaluation of pronoun
algorithms
 NPs annotated (known
candidates)
 only PRO NP were
marked referential (to
be resolved)
 no influence from
wrongly identified
candidates

25
task#2 Coreferential chains on pre-annotated texts
 cluster coreferential NPs
together in coreferential chains
 all referential NP were marked
(to be resolved), not only PRO
 NPs outside coreferential
chains were not annotated
 no influence from wrongly
identified candidates

26
an example: NER
www.linguateca.pt/avaliacaoconjunta

27
annotation tools
 PALinkA Perspicuous and Adjustable Links Annotator
http://clg.wlv.ac.uk/projects/PALinkA/index.php
 Alembic workbench a natural language engineering environment
for the development of tagged corpora
http://www.mitre.org/tech/alembic-workbench/
 ATLAS Architecture and Tools for Linguistic Analysis Systems
http://www.nist.gov/speech/atlas/
 CLaRK system an XML Based System For Corpora Development
http://www.bultreebank.org/clark/index.html
 GATE is an architecture, framework and development
environment for language engineering which can be also used to
annotate texts
http://www.gate.ac.uk/
 MMAX a tool for multi-modal annotation in XML, but the new
version is no longer free
http://mmax.eml-research.de/

28
References
Dale, Robert; Moils, Hermann; Sommers, Harold. 2000. Handbook of Natural Language Processing. New York/Basel:
Marcel Dekker, Inc.
Hasler, Laura K.; Naumann, K. ; Orasan, C. (2006). Guidelines for Annotation of Within-document NP Coreference
http://clg.wlv.ac.uk/projects/NP4E/NP_guidelines_2006.pdf.
Hajičova, E.; Abeillé, A.; Hajič, J.; Mirovský, J. 2010. Treebank annotation. in Indurkhya and Damerau (2010): 167-188.
Hirschman, Lynette; Mani, Inderjeet. 2003. Evaluation. in Mitkov, Ruslan (ed.) 2003, pp. 414-429.
Indurkhya, Nitin; Damerau, Fred (Eds.). 2010. Handbook of Natural Language Processing (2nd ed.). Chapman
& Hall/CRC.
McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003 , pp. 448-463.
McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language Studies. An advanced resource book.
Routledge.
Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press.
Mitkov, Ruslan ; Orasan, Constantin ; Evans, Richard. 1999. The importance of annotated corpora for NLP: the cases of
anaphora resolution and clause splitting. TALN ’99 The importance of annotated corpora for NLP.
http://clg.wlv.ac.uk/papers/mitkov-99b.pdf.
Orăsan, Constantin; Cristea, Dan; Mitkov, Ruslan; Branco António. Anaphora Resolution Exercise: An overview.
Proceedings of 6th Language Resources and Evaluation Conference (LREC’2008), Marrakesh, Morocco, 28 – 30
May http://clg.wlv.ac.uk/papers/713_paper.pdf.
Renouff, Antoinette; Kehoe, Andrew (eds.).2009. Corpus Linguistics: Refinements and Reassessments. Amsterdam/New
York: Rodopi.
Thompson, Henry S. 2000. Corpus Creation for Data-Intensive Linguistics. in Dale et al. (eds) 2000, pp. 385-401.
Xiao, Richard. 2010. Corpus Creation. in Indurkhya and Damerau (2010): 147-166.
Resources
http://www.ldc.upenn.edu/annotation/
http://www.routledge.com/textbooks/0415286239

Corpus annotation for corpus linguistics (nov2009)

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to Corpus annotation for corpus linguistics (nov2009) (20)

Recently uploaded (20)

Corpus annotation for corpus linguistics (nov2009)