SlideShare a Scribd company logo
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015.
Corpus Annotation with
Linked Open Data
John P. McCrae and Thierry Declerck
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Summary
• Inline and Stand-off annotation
• Web Annotation/Open Annotation
• NLP Interchange Format
• CoNLL-RDF
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Why annotate?
Ontologies capture facts about concepts, not the usage
of words
Lexicons capture facts about patterns and systems of
usage
Sometimes we wish to capture data about specific
usage
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Inline annotation
Typically with XML
<div type="essay">
<head>An Essay on Summer</head>
<p>Summer school in <date when="1990">MCMXC</date> was never easy;
it went by too quickly and left us wanting more.</p>
<p>But, as my friend <name type="person">Peter</name> said with his
inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>,
<said>It never pays to think too hard</said>. Or, as I would rather
put it, <quote xml:lang="es">Que sera, sera</quote>.</p>
</div>
Pros:
Easy and quick to do
Cons:
Limited expressivity
Complicates source document
Annotations cannot be added
later
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Stand-off Annotation
Annotation 1
Annotation 2
Annotation 3
Annotation 4
Source Document Annotation File
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Web Annotation
Annotation
recommendation
from W3C
https://www.w3.org/TR/annotation-model/
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Web Annotation: Target and Body
• body
• element containing the annotation
• object property: oa:hasBody (any RDF object)
• datatype property: oa:bodyValue (strings)
• target
• element being annotated
• any RDF object, including
• oa:Selector (more in a second)
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Selector Types
• oa:FragmentSelector
• Uses the IRI fragment specification defined by the representation's media
type.
• oa:TextQuoteSelector
• Describes a range of text by copying it, and including some of the text
immediately before (a prefix) and after (a suffix) it to distinguish it.
• oa:TextPositionSelector
• Describes a range of text by recording the start and end positions
• oa:DataPositionSelector
• Describes a range of data by recording the start and end positions of the
selection
• oa:SvgSelector
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Web Annotation Example
<http://example.org/name_example>
a oa:Annotation ;
oa:hasBody [
a oa:TextualBody ;
dc11:format "text/plain"^^xsd:string ;
rdf:value "PERSON"^^xsd:string ] ;
oa:hasTarget [
oa:hasSelector [
a oa:TextQuoteSelector ;
oa:exact "James Baker"^^xsd:string ] ;
oa:hasSource <https://catalog.ldc.upenn.edu/.../06/wsj_0655.name> ] .
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Web Annotation Example
name_example
oa:Annotation
oa:TextualBody
text/plain
PERSON
James Baker
https://catalog.ldc.upenn.edu/.../06/wsj_0655.name
oa:TextQuoteSelector
hasBody
hasTarget
hasSelector
format
value
exact
source
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Web Annotation Summary
• relatively good uptake
• reification
• annotation as n:m relation between bodies & targets
• with metadata
• powerful
• annotate all instances of a string at once using a
• very verbose
• previous example uses 10 triples
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
NLP Interchange Format
• String URIs
• e.g., in a web document
• can be directly used as object of oa:hasTarget
• simple ontology of linguistic data structures
• for selected, typical NLP annotations
• not covering all you ever need for linguistic annotations ;)
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
RFC 5147
Allows URIs to refer to fragments in text
https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#char=19,30
Character Offsets:
https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#line=0
Line Offsets:
https://.../wsj_0655.txt#char=19,30;md5=67f60186fe687bb898ab7faed17dd96a
Integrity Checks:
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
NLP Interchange Format
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Web Annotation + NIF
name_example
oa:Annotation
oa:TextualBody
text/plain
PERSON
https://.../wsj_0655.name#char=2,22
nif:String
hasBody
hasTarget
format
value
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
NLP Interchange Format
• Slightly simpler method of reference
• Saves some triples
• but still very verbose
• Less standardised and supported than just Web Annotation
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
CoNLL-RDF
CoNLL is a format family widely used in NLP
• tab-separated values
• one word per line
• one column for annotation type
• sentences separated by empty lines
• conventions for most types of word-based linguistic
• annotation
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
CoNLL Example
1_1 Sie sie P PPER
nom|pl|*|3 2 SB
1_2 dürfen dürfen V VMFIN
pl|3|pres|ind 0 --
1_3 eine ein A ART
acc|sg|fem 4 NK
1_4 Kopie Kopie N NN acc|sg|fem
12 OA
1_5 der der A ART
gen|sg|fem 6 NK
1_6 Software Software N NN gen|sg|fem
4 AG
ID
Word
Lemma
POS
Inflection
Dependency
Structure
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
CoNLL as RDF (simple)
1_1
Sie
sie
P
PPER
nom|pl|*|3
2
SB
WORD LEMMA
POS_COARSE
POS
FEATS
HEAD
EDGE
1_2
nif:nextWord
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
CoNLL as RDF (better)
1_1
Sie
sie
P
PPER
nom|pl|*|3
WORD LEMMA
POS_COARSE
POS
FEATS
1_2
nif:nextWord
SB
dürfen
dürfen
WORD
LEMMA
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
CoNLL as an RDF Tree
1_1
Sie
1_2
1_4
1_3
1_5
1_6
dürfen
eine
Kopie
der
Software
1_12
installieren
Easy to
query with
SPARQL
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182
Conclusion
RDF is a powerful method of representing corpus annotations
But
• Not well adopted by many major projects
• Can be verbose and hard to read
• Limited tool support
This should change over the next few years.

More Related Content

PPTX
Linked Open Data Cloud
PPTX
Introduction to RDF and related Vocabularies/Languages. Introduction to SPARQL
PPT
Lemon at-mlw3
PDF
NIF - NLP Interchange Format
PPTX
Wordnets and TEI-LEX
PDF
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
PPT
LOD2 Webinar Series: D2R and Sparqlify
Linked Open Data Cloud
Introduction to RDF and related Vocabularies/Languages. Introduction to SPARQL
Lemon at-mlw3
NIF - NLP Interchange Format
Wordnets and TEI-LEX
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
LOD2 Webinar Series: D2R and Sparqlify

What's hot (9)

PPTX
Semantic web-and-public-data - en
PPTX
LOD2 Webinar Series: 3rd relase of the Stack
PPTX
Freme at feisgiltt 2015 freme & linked data & localisers
PPTX
Fremeatfeisgiltt2015 fremelinkeddatalocalisers-150603090934-lva1-app6891
PDF
LOD2 Plenary Vienna 2012: WP3 - Knowledge Base Creation, Enrichment and Repair
ODP
Lod2 review meeting
ODP
Introduction to LDL 2012
PPT
XML, XML Databases and MPEG-7
PDF
RTÉ Content Discovery Project - Christophe Debruyne
Semantic web-and-public-data - en
LOD2 Webinar Series: 3rd relase of the Stack
Freme at feisgiltt 2015 freme & linked data & localisers
Fremeatfeisgiltt2015 fremelinkeddatalocalisers-150603090934-lva1-app6891
LOD2 Plenary Vienna 2012: WP3 - Knowledge Base Creation, Enrichment and Repair
Lod2 review meeting
Introduction to LDL 2012
XML, XML Databases and MPEG-7
RTÉ Content Discovery Project - Christophe Debruyne
Ad

Similar to Corpus Annotation with Linked Open Data (20)

PPT
Cole using oa-intro-dlf2012
DOCX
Annotations are coming to the web
ODP
Ontologies and Semantic in OpenSource projects
PPTX
Semantic web xml-rdf-dom parser
PDF
A Probabilistic Framework For Information Modelling And Retrieval Based On Us...
PPTX
The Social Semantic Web
PPTX
SNSW CO3.pptx
PPTX
20100614 ISWSA Keynote
PDF
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
PDF
Open Annotation: Annotating High Energy Physics on the Web
ODP
NIF - Version 1.0 - 2011/10/23
PPT
Introduction to RDF
PPT
Intro semanticweb
PPTX
Sem webmaubeuge
PPTX
Making the semantic web work
PPT
Toward The Semantic Deep Web
PPT
Jpl presentation
PPT
Jpl presentation
PPT
Jpl presentation
PDF
Annotating Scholarly Works - the W3C Open Annotation Model
Cole using oa-intro-dlf2012
Annotations are coming to the web
Ontologies and Semantic in OpenSource projects
Semantic web xml-rdf-dom parser
A Probabilistic Framework For Information Modelling And Retrieval Based On Us...
The Social Semantic Web
SNSW CO3.pptx
20100614 ISWSA Keynote
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Open Annotation: Annotating High Energy Physics on the Web
NIF - Version 1.0 - 2011/10/23
Introduction to RDF
Intro semanticweb
Sem webmaubeuge
Making the semantic web work
Toward The Semantic Deep Web
Jpl presentation
Jpl presentation
Jpl presentation
Annotating Scholarly Works - the W3C Open Annotation Model
Ad

More from PretaLLOD (12)

PPTX
Dfki Linghub presentation
PDF
Towards the Detection and Formal Representation of Semantic Shifts in Inflect...
PDF
OntoLex-Lemon as a Possible Bridge between WordNets and Full Lexical Descript...
PDF
OntoLex-TEI: Inspiration from Global WordNet
PDF
Overview of the Sustainability Plans of the ICT-29b) Projects
PDF
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
PDF
lexicog: Overview of the New Module for Lexicography of OntoLex-lemon
PDF
ELSE IF 2019: Language Technology Market: State-of-the-Art, Trends and Value ...
PDF
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
PDF
ELSE IF 2019: What’s next for Multilingual Europe?
PDF
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
PDF
Language technology market and components taxonomy
Dfki Linghub presentation
Towards the Detection and Formal Representation of Semantic Shifts in Inflect...
OntoLex-Lemon as a Possible Bridge between WordNets and Full Lexical Descript...
OntoLex-TEI: Inspiration from Global WordNet
Overview of the Sustainability Plans of the ICT-29b) Projects
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
lexicog: Overview of the New Module for Lexicography of OntoLex-lemon
ELSE IF 2019: Language Technology Market: State-of-the-Art, Trends and Value ...
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
ELSE IF 2019: What’s next for Multilingual Europe?
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
Language technology market and components taxonomy

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Programs and apps: productivity, graphics, security and other tools
Encapsulation_ Review paper, used for researhc scholars
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
sap open course for s4hana steps from ECC to s4
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Chapter 3 Spatial Domain Image Processing.pdf

Corpus Annotation with Linked Open Data

  • 1. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015. Corpus Annotation with Linked Open Data John P. McCrae and Thierry Declerck
  • 2. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Summary • Inline and Stand-off annotation • Web Annotation/Open Annotation • NLP Interchange Format • CoNLL-RDF
  • 3. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Why annotate? Ontologies capture facts about concepts, not the usage of words Lexicons capture facts about patterns and systems of usage Sometimes we wish to capture data about specific usage
  • 4. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Inline annotation Typically with XML <div type="essay"> <head>An Essay on Summer</head> <p>Summer school in <date when="1990">MCMXC</date> was never easy; it went by too quickly and left us wanting more.</p> <p>But, as my friend <name type="person">Peter</name> said with his inimitable <foreign xml:lang="fr">je ne sais quoi</foreign>, <said>It never pays to think too hard</said>. Or, as I would rather put it, <quote xml:lang="es">Que sera, sera</quote>.</p> </div> Pros: Easy and quick to do Cons: Limited expressivity Complicates source document Annotations cannot be added later
  • 5. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Stand-off Annotation Annotation 1 Annotation 2 Annotation 3 Annotation 4 Source Document Annotation File
  • 6. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Web Annotation Annotation recommendation from W3C https://www.w3.org/TR/annotation-model/
  • 7. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Web Annotation: Target and Body • body • element containing the annotation • object property: oa:hasBody (any RDF object) • datatype property: oa:bodyValue (strings) • target • element being annotated • any RDF object, including • oa:Selector (more in a second)
  • 8. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Selector Types • oa:FragmentSelector • Uses the IRI fragment specification defined by the representation's media type. • oa:TextQuoteSelector • Describes a range of text by copying it, and including some of the text immediately before (a prefix) and after (a suffix) it to distinguish it. • oa:TextPositionSelector • Describes a range of text by recording the start and end positions • oa:DataPositionSelector • Describes a range of data by recording the start and end positions of the selection • oa:SvgSelector
  • 9. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Web Annotation Example <http://example.org/name_example> a oa:Annotation ; oa:hasBody [ a oa:TextualBody ; dc11:format "text/plain"^^xsd:string ; rdf:value "PERSON"^^xsd:string ] ; oa:hasTarget [ oa:hasSelector [ a oa:TextQuoteSelector ; oa:exact "James Baker"^^xsd:string ] ; oa:hasSource <https://catalog.ldc.upenn.edu/.../06/wsj_0655.name> ] .
  • 10. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Web Annotation Example name_example oa:Annotation oa:TextualBody text/plain PERSON James Baker https://catalog.ldc.upenn.edu/.../06/wsj_0655.name oa:TextQuoteSelector hasBody hasTarget hasSelector format value exact source
  • 11. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Web Annotation Summary • relatively good uptake • reification • annotation as n:m relation between bodies & targets • with metadata • powerful • annotate all instances of a string at once using a • very verbose • previous example uses 10 triples
  • 12. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 NLP Interchange Format • String URIs • e.g., in a web document • can be directly used as object of oa:hasTarget • simple ontology of linguistic data structures • for selected, typical NLP annotations • not covering all you ever need for linguistic annotations ;)
  • 13. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 RFC 5147 Allows URIs to refer to fragments in text https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#char=19,30 Character Offsets: https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#line=0 Line Offsets: https://.../wsj_0655.txt#char=19,30;md5=67f60186fe687bb898ab7faed17dd96a Integrity Checks:
  • 14. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 NLP Interchange Format
  • 15. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Web Annotation + NIF name_example oa:Annotation oa:TextualBody text/plain PERSON https://.../wsj_0655.name#char=2,22 nif:String hasBody hasTarget format value
  • 16. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 NLP Interchange Format • Slightly simpler method of reference • Saves some triples • but still very verbose • Less standardised and supported than just Web Annotation
  • 17. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 CoNLL-RDF CoNLL is a format family widely used in NLP • tab-separated values • one word per line • one column for annotation type • sentences separated by empty lines • conventions for most types of word-based linguistic • annotation
  • 18. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 CoNLL Example 1_1 Sie sie P PPER nom|pl|*|3 2 SB 1_2 dürfen dürfen V VMFIN pl|3|pres|ind 0 -- 1_3 eine ein A ART acc|sg|fem 4 NK 1_4 Kopie Kopie N NN acc|sg|fem 12 OA 1_5 der der A ART gen|sg|fem 6 NK 1_6 Software Software N NN gen|sg|fem 4 AG ID Word Lemma POS Inflection Dependency Structure
  • 19. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 CoNLL as RDF (simple) 1_1 Sie sie P PPER nom|pl|*|3 2 SB WORD LEMMA POS_COARSE POS FEATS HEAD EDGE 1_2 nif:nextWord
  • 20. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 CoNLL as RDF (better) 1_1 Sie sie P PPER nom|pl|*|3 WORD LEMMA POS_COARSE POS FEATS 1_2 nif:nextWord SB dürfen dürfen WORD LEMMA
  • 21. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 CoNLL as an RDF Tree 1_1 Sie 1_2 1_4 1_3 1_5 1_6 dürfen eine Kopie der Software 1_12 installieren Easy to query with SPARQL
  • 22. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015 and No 825182 Conclusion RDF is a powerful method of representing corpus annotations But • Not well adopted by many major projects • Can be verbose and hard to read • Limited tool support This should change over the next few years.