STANDOFF ANNOTATION FOR THE ANCIENT
GREEK AND LATIN DEPENDENCY TREEBANK
DATECH2019, 9.5.2019
Giuseppe G. A. Celano
DFG PROJECT
1. Revise: correct the errors
2. Standardize: make the AGLDT standoff as PAULA XML (and convert into
UD)
1. standoff for multiple annotations and/or multiple interpretations of the
same token
2. standoff to overcome the problem of conflicting hierarchies
3. Expand: add new annotations
(https://git.informatik.uni-leipzig.de/celano/agldt1)
THE AGLT
▸ Ancient Greek texts: 557,922 tokens
▸ Latin texts: 79,697 tokens
▸ available in GitHub/GitLab:
▸ https://perseusdl.github.io/treebank_data/
▸ https://git.informatik.uni-leipzig.de/celano/agldt1
LABELED DIRECTED ACYCLIC GRAPHS
THE PERSEUS TREEBANK (LAST RELEASE, 2.1)
▸ 12 texts
composition date text token number
63 BC Cicero, In Catilinam 6,652
51 BC Caesar, De Bello Gallico 1556
post 44 BC Sallust, Bellum Catilinae 13191
ca 25 BC Prop. Elegiae 5297
29-19 BC Vergil, Aeneid 2839
ca 8 AD Ov., Metamorphoses 5209
14 AD Aug., Res Gestae 3035
15-50 AD Ph., Fabulae 6588
ca 100 AD? Petr., Satyricon 14177
ca 100-110 AD Tac. Historiae 3531
117-138 AD Suet., Vita Divi Augusti 8313
ca 400 AD Ger. Vulgata 9309
TREEBANK PIPELINE
start
Choose
a TEI/XML text
preliminary
automatic
annotation
tokenize it
rule-based
POS Tagger/Parser
manual
correction
end
THE PERSEUS TREEBANK: TEI XML TEXT
THE PERSEUS TREEBANK: INLINE ANNOTATION
INLINE ANNOTATION: ADVANTAGES
1. easy to add
2. easy to query
3. well supported by annotation tools
INLINE ANNOTATION: DISADVANTAGES
1. the tokenized text becomes the new base text
2. after text extraction from a TEI text, links to the original text is virtually lost
(e.g., amabam-que and content of some editorial markup)
3. it is unfeasible to connect such base texts to other annotation layers with
different tokenization schemes. For example:
‣ amabamque: one phonetic word
‣ amabam-que: two syntactic words
‣ am-a-ba-m-que: five morphemes
‣ verse vs. sentence
STANDOFF ANNOTATION
1. each annotation layer is attached separately to the original text
(i.e., the base text).
2. an annotation layer references the original text or another
annotaion layer which references the original text
STANDOFF ANNOTATION: PAULA XML
1. Open format based on the principles of LAF (ISO 24612:2012)
2. already employed in a number of historical language corpora
3. the base text is a bare xml text, which is virtually referenced only
via offsets
THE CASE STUDY: CAESAR’S DE BELLO CIVILI
1. the base text is a ‘complex’ TEI xml file’
‣ reference is made via XPath coinciding with CTS divisions
(https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
TOKENIZATION/WORD SEGMENTATION
▸ Latin: rule-based
▸ select the text to annotate from the TEI XML file
▸ identify abbreviations (word list + regular expressions)
▸ Cn. = Gnaeus
▸ list of not-to-tokenize words (e.g., Antigone, aeque)
▸ tokens ending with ne/que/ve
▸ list of to-tokenize words (e.g., nequis, nobiscum)
PAULA: TEI BASE TEXT
PAULA: TOKENIZATION
PAULA: SENTENCE SPLIT
CURRENT CHALLENGES
▸ extraction of text from TEI texts may require different scripts
▸ what is the ideal tokenization/word segmentation?
▸ annotation tools do not support standoff annotation
▸ lack of support for XPointer
THANK YOU FOR YOUR ATTENTION!

More Related Content

PDF
Ctcompare: Comparing Multiple Code Trees for Similarity
PDF
Embedding NomLex-BR nominalizations into OpenWordnet-PT
PDF
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
PDF
LaTeX 3 Paper
PDF
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
PDF
December 1995
PDF
Portable TeX Documents (PTD): PackagingCon 2021
PDF
Mit gnu scheme reference manual
Ctcompare: Comparing Multiple Code Trees for Similarity
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
LaTeX 3 Paper
WEB PROGRAMMING UNIT VI BY BHAVSINGH MALOTH
December 1995
Portable TeX Documents (PTD): PackagingCon 2021
Mit gnu scheme reference manual

Similar to Session6 04.giuseppe celano (20)

PDF
Haskell vs. F# vs. Scala
PDF
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
PDF
Basic Introduction to LaTeX
PDF
sigproc-sp.pdf
PDF
Unixshellscript 100406085942-phpapp02
PDF
abc12
PDF
popopo
PDF
PPTX
sphinx-i18n — The True Story
PPTX
Sour Pickles
PDF
Ekon bestof rtl_delphi
PDF
Bash shell programming in linux
PPTX
BEL.bio Overview and BioDati Studio
PDF
21bUc8YeDzZpE
PDF
21bUc8YeDzZpE
PDF
21bUc8YeDzZpE
PDF
(Ebook) linux shell scripting tutorial
PDF
Algorithm2e package for Latex
PDF
Dsohowto
PDF
Cross-lingual event-mining using wordnet as a shared knowledge interface
Haskell vs. F# vs. Scala
Getty Vocabulary Program LOD: Ontologies and Semantic Representation
Basic Introduction to LaTeX
sigproc-sp.pdf
Unixshellscript 100406085942-phpapp02
abc12
popopo
sphinx-i18n — The True Story
Sour Pickles
Ekon bestof rtl_delphi
Bash shell programming in linux
BEL.bio Overview and BioDati Studio
21bUc8YeDzZpE
21bUc8YeDzZpE
21bUc8YeDzZpE
(Ebook) linux shell scripting tutorial
Algorithm2e package for Latex
Dsohowto
Cross-lingual event-mining using wordnet as a shared knowledge interface
Ad

More from IMPACT Centre of Competence (20)

PDF
Session6 01.helmut schmid
PDF
Session1 03.hsian-an wang
PDF
Session7 03.katrien depuydt
PDF
Session7 02.peter kiraly
PDF
Session6 03.sandra young
PDF
Session6 02.jeremi ochab
PDF
Session5 04.evangelos varthis
PDF
Session5 03.george rehm
PDF
Session5 02.tom derrick
PDF
Session5 01.rutger vankoert
PDF
Session4 04.senka drobac
PDF
Session3 04.arnau baro
PDF
Session3 03.christian clausner
PDF
Session3 02.kimmo ketunnen
PDF
Session3 01.clemens neudecker
PDF
Session2 04.ashkan ashkpour
PDF
Session2 03.juri opitz
PDF
Session2 02.christian reul
PDF
Session2 01.emad mohamed
PDF
Session1 04.florian fink
Session6 01.helmut schmid
Session1 03.hsian-an wang
Session7 03.katrien depuydt
Session7 02.peter kiraly
Session6 03.sandra young
Session6 02.jeremi ochab
Session5 04.evangelos varthis
Session5 03.george rehm
Session5 02.tom derrick
Session5 01.rutger vankoert
Session4 04.senka drobac
Session3 04.arnau baro
Session3 03.christian clausner
Session3 02.kimmo ketunnen
Session3 01.clemens neudecker
Session2 04.ashkan ashkpour
Session2 03.juri opitz
Session2 02.christian reul
Session2 01.emad mohamed
Session1 04.florian fink
Ad

Recently uploaded (20)

PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
The various Industrial Revolutions .pptx
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
STKI Israel Market Study 2025 version august
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Architecture types and enterprise applications.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
Getting started with AI Agents and Multi-Agent Systems
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
Build Your First AI Agent with UiPath.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
CloudStack 4.21: First Look Webinar slides
TEXTILE technology diploma scope and career opportunities
Chapter 5: Probability Theory and Statistics
Microsoft Excel 365/2024 Beginner's training
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Zenith AI: Advanced Artificial Intelligence
The various Industrial Revolutions .pptx
2018-HIPAA-Renewal-Training for executives
sustainability-14-14877-v2.pddhzftheheeeee
Custom Battery Pack Design Considerations for Performance and Safety
STKI Israel Market Study 2025 version august
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
NewMind AI Weekly Chronicles – August ’25 Week III
Architecture types and enterprise applications.pdf
A review of recent deep learning applications in wood surface defect identifi...

Session6 04.giuseppe celano

  • 1. STANDOFF ANNOTATION FOR THE ANCIENT GREEK AND LATIN DEPENDENCY TREEBANK DATECH2019, 9.5.2019 Giuseppe G. A. Celano
  • 2. DFG PROJECT 1. Revise: correct the errors 2. Standardize: make the AGLDT standoff as PAULA XML (and convert into UD) 1. standoff for multiple annotations and/or multiple interpretations of the same token 2. standoff to overcome the problem of conflicting hierarchies 3. Expand: add new annotations (https://git.informatik.uni-leipzig.de/celano/agldt1)
  • 3. THE AGLT ▸ Ancient Greek texts: 557,922 tokens ▸ Latin texts: 79,697 tokens ▸ available in GitHub/GitLab: ▸ https://perseusdl.github.io/treebank_data/ ▸ https://git.informatik.uni-leipzig.de/celano/agldt1
  • 5. THE PERSEUS TREEBANK (LAST RELEASE, 2.1) ▸ 12 texts composition date text token number 63 BC Cicero, In Catilinam 6,652 51 BC Caesar, De Bello Gallico 1556 post 44 BC Sallust, Bellum Catilinae 13191 ca 25 BC Prop. Elegiae 5297 29-19 BC Vergil, Aeneid 2839 ca 8 AD Ov., Metamorphoses 5209 14 AD Aug., Res Gestae 3035 15-50 AD Ph., Fabulae 6588 ca 100 AD? Petr., Satyricon 14177 ca 100-110 AD Tac. Historiae 3531 117-138 AD Suet., Vita Divi Augusti 8313 ca 400 AD Ger. Vulgata 9309
  • 6. TREEBANK PIPELINE start Choose a TEI/XML text preliminary automatic annotation tokenize it rule-based POS Tagger/Parser manual correction end
  • 7. THE PERSEUS TREEBANK: TEI XML TEXT
  • 8. THE PERSEUS TREEBANK: INLINE ANNOTATION
  • 9. INLINE ANNOTATION: ADVANTAGES 1. easy to add 2. easy to query 3. well supported by annotation tools
  • 10. INLINE ANNOTATION: DISADVANTAGES 1. the tokenized text becomes the new base text 2. after text extraction from a TEI text, links to the original text is virtually lost (e.g., amabam-que and content of some editorial markup) 3. it is unfeasible to connect such base texts to other annotation layers with different tokenization schemes. For example: ‣ amabamque: one phonetic word ‣ amabam-que: two syntactic words ‣ am-a-ba-m-que: five morphemes ‣ verse vs. sentence
  • 11. STANDOFF ANNOTATION 1. each annotation layer is attached separately to the original text (i.e., the base text). 2. an annotation layer references the original text or another annotaion layer which references the original text
  • 12. STANDOFF ANNOTATION: PAULA XML 1. Open format based on the principles of LAF (ISO 24612:2012) 2. already employed in a number of historical language corpora 3. the base text is a bare xml text, which is virtually referenced only via offsets
  • 13. THE CASE STUDY: CAESAR’S DE BELLO CIVILI 1. the base text is a ‘complex’ TEI xml file’ ‣ reference is made via XPath coinciding with CTS divisions (https://git.informatik.uni-leipzig.de/celano/latinnlp/tree/master/case-study))
  • 14. TOKENIZATION/WORD SEGMENTATION ▸ Latin: rule-based ▸ select the text to annotate from the TEI XML file ▸ identify abbreviations (word list + regular expressions) ▸ Cn. = Gnaeus ▸ list of not-to-tokenize words (e.g., Antigone, aeque) ▸ tokens ending with ne/que/ve ▸ list of to-tokenize words (e.g., nequis, nobiscum)
  • 18. CURRENT CHALLENGES ▸ extraction of text from TEI texts may require different scripts ▸ what is the ideal tokenization/word segmentation? ▸ annotation tools do not support standoff annotation ▸ lack of support for XPointer
  • 19. THANK YOU FOR YOUR ATTENTION!