SlideShare a Scribd company logo
Poio API and GrAF-XML
A radical stand-off approach in
language documentation and language typology
Jonathan Blumtritt, Cologne Center for eHumanities, University of Cologne
Peter Bouda, Centro Interdisciplinar de Documentação Linguística e Social
Felix Rau, Department of Linguistics, University of Cologne
Overview
● Existing infrastructure and workflows
● CLARIN
● Annotation graphs
● GrAF and Poio API
● Example: Elan EAF to GrAF-XML
● CLASS
Fieldwork
Fotos
Existing Infrastructure
LD tools and standards
● Elan: EAF, MPEG, WAV
● Toolbox: TXT, XML, WAV
● Arbil: IMDI/CIMDI („Component MetaData
Infrastructure“)
● Praat: XML, WAV
● ...
● No standards for tier hierarchies, tier names or
annotation schemes
● Efforts in ISOcat
● European initiative within the European Research
Infrastructure Consortium: Common Language Resources
and Technology Infrastructure (CLARIN)
● aims at providing easy and sustainable access for scholars in
the humanities and social sciences to digital language data
● Started in 2006, part of a roadmap process, timeline currently
ending 2020
● CLARIN-D: working groups in Germany
● Curation projects for different research areas in linguistics
Annotation Graphs
● the underlying data model for linguistic annotations
● pivot structure for linguistic data
● time vs. byte offsets
● not hierarchical (but trees are also graphs)
● stand-off annotation
● "It is important to recognize that translation into AGs does
not magically create compatibility among systems whose
semantics are different." [Bird & Liberman 2001]
AGs visualized
GrAF
● GrAF: Graph Annotation Framework
● ISO 24612: Language resource management - Linguistic
annotation framework (LAF)
● Started as stand-off version of XCES
● API and representation as data structures, not a file format
● GrAF/XML as XML representation
● Used for the MASC of the ANC
● Nodes, edges, regions, annotations, feature structures
TEI and GrAF
● Schemata for GrAF created with TEI Roma
● Custumized version of TEI P5 schema
● ODD: „One Document Does it all“
● GrAF is not TEI compliant
● Share data types and feature structures of annotations
● TEI has „stand-off“ variant, uses XPointer/XLink
– Primary data has to be XML
Why we use GrAF
● Because it's new! :-)
● No inline markup
● Radical stand-off approach
– Easier to share and manage data
– Preferred solution to archive cultural heritage
– Ideal for sparse annotations
● Existing code: Java and Python
● The beauty of annotation graphs
Poio API
● Think of GrAF as an assembly language for linguistic
annotation; then Poio API is a libray to map from and to
higher-level languages
● Subset of GrAF to represent tier based annotation
● Filters and filter chains for search
● Plugin mechanism for file formats
– Mapping semantics: tiers and annotations to nodes and edges
● Meta-data for additional information (tier types etc.)
Example: Mapping of EAF to GrAF-XML
Elan EAF
<TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="words"
PARENT_REF="W-Spch" PARTICIPANT="" TIER_ID="W-Words">
<ANNOTATION>
<ALIGNABLE_ANNOTATION ANNOTATION_ID="a23"
TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts6">
<ANNOTATION_VALUE>so</ANNOTATION_VALUE>
</ALIGNABLE_ANNOTATION>
</ANNOTATION>
<ANNOTATION>
[...]
</ANNOTATION>
</TIER>
GrAF entities
GrAF structure
GrAF-XML
<node xml:id="words..W-Words..na23">
<link targets="words..W-Words..ra23"/>
</node>
<region anchors="780 1340" xml:id="words..W-Words..ra23"/>
<edge from="utterance..W-Spch..n8" to="words..W-Words..na23"
xml:id="ea23"/>
<a as="words" label="words" ref="words..W-Words..na23"
xml:id="a23">
<fs>
<f name="annotation_value">so</f>
</fs>
</a>
Tier hierarchies
[
['utterance..K-Spch'],
['utterance..W-Spch',
['words..W-Words',
['part_of_speech..W-POS']
],
['phonetic_transcription..W-IPA']
],
['gestures..W-RGU',
['gesture_phases..W-RGph',
['gesture_meaning..W-RGMe']
]
],
['gestures..K-RGU',
['gesture_phases..K-RGph',
['gesture_meaning..K-RGMe']
]
]
]
The code
ag = poioapi.annotationgraph.AnnotationGraph()
parser = poioapi.io.ElanParser("example.eaf")
writer = poioapi.io.graf.Writer()
converter = poioapi.io.graf.GrAFConverter(parser, writer)
converter.parse()
converter.write("example.hdr")
Analysis workflows
● Graph-based methods
● Pipe to scientific Python libraries
● GrAF connectors for major linguistic workflow
tools (GATE and Apache UIMA)
● Example: Polysemy in dictionaries
● Example: Counting word orders
CLASS
Thank you for your attention!
pbouda@cidles.eu
Links
Clarin curation project:
http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr
Poio API:
http://media.cidles.eu/poio/poio-api/
GrAF:
http://www.xces.org/ns/GrAF/1.0/
CLASS:
http://class.uni-koeln.de

More Related Content

ODP
Querying GrAF data in linguistic analysis
ODP
Poio API: a CLARIN-D curation project for language documentation and language...
PPT
Poster
PPTX
Oc wg-nif-20130711
PDF
Software Frameworks for Music Information Retrieval
PPTX
Crossing the chasm between ontology engineering and application development
PDF
Gettext i18n system - internationalization for gettext
ODP
O parl Developer presentation Fusepool-Locationmapper Andreas Kuckartz
Querying GrAF data in linguistic analysis
Poio API: a CLARIN-D curation project for language documentation and language...
Poster
Oc wg-nif-20130711
Software Frameworks for Music Information Retrieval
Crossing the chasm between ontology engineering and application development
Gettext i18n system - internationalization for gettext
O parl Developer presentation Fusepool-Locationmapper Andreas Kuckartz

What's hot (13)

PDF
R data presentation
PDF
Mapping the Web Ontology Language to OpenApi
PDF
PyData2015
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PDF
Session3 01.clemens neudecker
PDF
OWLGrEd/CNL: a Graphical Editor for OWL with Multilingual CNL Support
ODP
OOoCon Lpod
PDF
Challenges operating and scaling GrapheneDB by Francisco Fernandez
PPTX
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
PDF
FIDA: a framework to automatically integrate FPGA kernels within Data-Science...
PDF
Vocabulary for Linked Data Visualization Model - Dateso 2015
PPT
Linq presentation by vaidhesh
PPTX
OpenGeoData Italia 2014 - Marco Fago "Infrastrutture di dati territoriali, IN...
R data presentation
Mapping the Web Ontology Language to OpenApi
PyData2015
OCR-D: An end-to-end open source OCR framework for historical printed documents
Session3 01.clemens neudecker
OWLGrEd/CNL: a Graphical Editor for OWL with Multilingual CNL Support
OOoCon Lpod
Challenges operating and scaling GrapheneDB by Francisco Fernandez
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
FIDA: a framework to automatically integrate FPGA kernels within Data-Science...
Vocabulary for Linked Data Visualization Model - Dateso 2015
Linq presentation by vaidhesh
OpenGeoData Italia 2014 - Marco Fago "Infrastrutture di dati territoriali, IN...
Ad

Viewers also liked (6)

ODP
Poio API - An annotation framework to bridge Language Documentation and Natur...
PPTX
Smart Pen Presentation
PPTX
Noord januari 2013
PPT
My Presentation
ODP
Multimiedia project
PPTX
How community software supports language documentation and data analysis
Poio API - An annotation framework to bridge Language Documentation and Natur...
Smart Pen Presentation
Noord januari 2013
My Presentation
Multimiedia project
How community software supports language documentation and data analysis
Ad

Similar to Poio API and GraF-XML @ Balisage 2013 (6)

PDF
PPT
Pedagogical applications of corpus data for English for General and Specific ...
ODP
NIF 2.0 Phd thesis intermediate report
PPTX
Corpus annotation for corpus linguistics (nov2009)
PPTX
xAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies
Pedagogical applications of corpus data for English for General and Specific ...
NIF 2.0 Phd thesis intermediate report
Corpus annotation for corpus linguistics (nov2009)
xAPI Vocabulary - Improving Semantic Interoperability of Controlled Vocabularies

Recently uploaded (20)

PPTX
Machine Learning_overview_presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
Machine Learning_overview_presentation.pptx
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx

Poio API and GraF-XML @ Balisage 2013

  • 1. Poio API and GrAF-XML A radical stand-off approach in language documentation and language typology Jonathan Blumtritt, Cologne Center for eHumanities, University of Cologne Peter Bouda, Centro Interdisciplinar de Documentação Linguística e Social Felix Rau, Department of Linguistics, University of Cologne
  • 2. Overview ● Existing infrastructure and workflows ● CLARIN ● Annotation graphs ● GrAF and Poio API ● Example: Elan EAF to GrAF-XML ● CLASS
  • 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  • 6. ● European initiative within the European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (CLARIN) ● aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data ● Started in 2006, part of a roadmap process, timeline currently ending 2020 ● CLARIN-D: working groups in Germany ● Curation projects for different research areas in linguistics
  • 7. Annotation Graphs ● the underlying data model for linguistic annotations ● pivot structure for linguistic data ● time vs. byte offsets ● not hierarchical (but trees are also graphs) ● stand-off annotation ● "It is important to recognize that translation into AGs does not magically create compatibility among systems whose semantics are different." [Bird & Liberman 2001]
  • 9. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  • 10. TEI and GrAF ● Schemata for GrAF created with TEI Roma ● Custumized version of TEI P5 schema ● ODD: „One Document Does it all“ ● GrAF is not TEI compliant ● Share data types and feature structures of annotations ● TEI has „stand-off“ variant, uses XPointer/XLink – Primary data has to be XML
  • 11. Why we use GrAF ● Because it's new! :-) ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● The beauty of annotation graphs
  • 12. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Meta-data for additional information (tier types etc.)
  • 13. Example: Mapping of EAF to GrAF-XML
  • 14. Elan EAF <TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="words" PARENT_REF="W-Spch" PARTICIPANT="" TIER_ID="W-Words"> <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a23" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts6"> <ANNOTATION_VALUE>so</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION> <ANNOTATION> [...] </ANNOTATION> </TIER>
  • 17. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  • 19. The code ag = poioapi.annotationgraph.AnnotationGraph() parser = poioapi.io.ElanParser("example.eaf") writer = poioapi.io.graf.Writer() converter = poioapi.io.graf.GrAFConverter(parser, writer) converter.parse() converter.write("example.hdr")
  • 20. Analysis workflows ● Graph-based methods ● Pipe to scientific Python libraries ● GrAF connectors for major linguistic workflow tools (GATE and Apache UIMA) ● Example: Polysemy in dictionaries ● Example: Counting word orders
  • 21. CLASS
  • 22. Thank you for your attention! pbouda@cidles.eu
  • 23. Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr Poio API: http://media.cidles.eu/poio/poio-api/ GrAF: http://www.xces.org/ns/GrAF/1.0/ CLASS: http://class.uni-koeln.de