Mining Paper Catalogues
A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology
Tim Evans1, Felix Kußmaul2
23rd Annual Meeting EAA, Maastricht
31 August 2017
1Archaeology Data Service, University of York
2Archaeological Institute, University of Cologne
Data Source
Figure 1: Sample from Ettlinger, Conspectus.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 2/22
Oh dear!
Problem
Running texts contain a lot of irrelevant information (for machine processing).
This makes database lookups without keywords extremely inefficient.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 3/22
What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
Information Extraction
Definition: Information Extraction (IE)
“[IE refers to] the identification and extraction of instances of a particular class
of events or relationships in a natural language text and their transformation
into a structured representation.” – Grishman 1997, Eikvil 1999
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 5/22
IE Process Pipeline
Stanford ClearNLP OpenNLP CoreNLP UIMA Ruta
Tokenisation Lemmatisation POS-Tagging NER Information
Extraction
unstructured
document structured data
PosMapper CoreNLP MatePos OpenNLP CoreNLP
Figure 2: IE process pipeline.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 6/22
Named Entity Recognition
The quick brown fox jump over the lazy dog .
DT JJ JJ NN VBD IN DT JJ NN .
jumps
Figure 3: POS-tagging examples after lemmatisation.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
Named Entity Recognition
The quick brown fox jump over the lazy dog .
DT JJ JJ NN VBD IN DT JJ NN .
jumps
Figure 3: POS-tagging examples after lemmatisation.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
Adapting the NER
Most NERs (e. g. Stanford CoreNLP) only recognise 8 entities types:
PERSON DATE
ORGANIZATION TIME
LOCATION MONEY
PERCENT MISC
So we have to add the custom entity type FORM.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 8/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Figure 4: Excerpt from Gempeler,
Elephantine X.
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Figure 4: Excerpt from Gempeler,
Elephantine X.
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Figure 4: Excerpt from Gempeler,
Elephantine X.
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!
Figure 5: Manually annotated sentence
from Ettlinger, Conspectus in iepy.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
Temporal Expressions
With HeidelTime temporal expressions are mapped to TIMEX3 standard
around 140 B. C. −→ APPROX BC0140
Spätes 3.–4. Jh. n. Chr. −→ END 02; 03
second quarter first century B. C. −→ XXXX-Q2 BC00
first half third century A. D. −→ XXXX-H1 02
HeidelTime supports many other languages, e. g. German, Italian, French, …
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 10/22
Relation Extraction
Subject Relation Object
quick brown fox jump over lazy dog
K 612 dates 031
Subform 23.2 occurs North Italy
Subform 23.2 dates XXXX-Q2 002
⇒ e. g.
{ "form": "23.2",
"dating": "XXXX-Q2 00" }
1“4th century A. D.”
2“second and third quarters of the first century A. D.”
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
Relation Extraction
Subject Relation Object
quick brown fox jump over lazy dog
K 612 dates 031
Subform 23.2 occurs North Italy
Subform 23.2 dates XXXX-Q2 002
⇒ e. g.
{ "form": "23.2",
"dating": "XXXX-Q2 00" }
1“4th century A. D.”
2“second and third quarters of the first century A. D.”
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
MULTILINGUALISM
Background
Two problems:
• Linguistic
• Conceptual
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 12/22
Different languages
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 13/22
Different traditions
Figure 6: Plate, platter or dish?
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 14/22
Creating controlled vocabularies
Creating wordlists that project team would be most useful to describe the key
features of a vessel or sherd
• Sherd type (e.g. rim or handle)
• Form (e.g. plate or bowl)
• Decoration form (e.g. burnished)
• Decoration color (e.g. yellow)
• Fabric (e.g. Dressel 28 fabric)
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 15/22
Lessons from ARIADNE
Used tools and methodology developed for the ARIADNE project by the
Hypermedia Research Group at the University of South Wales
• Created a neutral spine based on the Getty Institute’s Art and Architecture
Thesaurus (AAT)
• This spine was populated by members from partner organisations,
identifying common terms and concepts within it
• Project partners then mapped terms in their language to this neutral spine
• French terms supplied courtesy of a 2001 Masters thesis by Caroline Sourzat
(thanks to Eleni Schindler Kaudelka for identifying this on the ArchAIDE blog!)
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 16/22
Mapping terms and concepts (part 1)
Often this was very straightforward, for example:
• The Italian terms graffita, graffita a punta, graffita a stecca = “sgraffito”
(http://vocab.getty.edu/aat/300266416)
• The Spanish term Cántaro = “jars” (http://vocab.getty.edu/aat/300195348)
• The German terms gebogener Henkel, Ohrförmiger Henkel, langer
Vertikalhenkel = “handles” (http://vocab.getty.edu/aat/300266416)
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 17/22
Mapping terms and concepts (part 2)
Often this was more complicated, with partners having differing perceptions on
what to call something (e.g. “plate” versus “platter”)
In truth, this confusion may also be reflected by what has come out of the ground!
An advantage of using the AAT (a “SKOS’d” thesaurus), is that ambiguity or
difference in nomenclature can be resolved by a broader term or concept, so for
example …
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 18/22
Mapping terms and concepts (part 3)
Looking at the hierarchies for plate and platter in the AAT we can see that both
are “dishes (vessels for food)”, or even broader “culinary containers”. So whole
we can retain our original classifications (and this is essential for text mining), we
can agree at a fundamental level what these fundamentally are
Figure 7: AAT Hierarchies for Plate and Platter
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 19/22
Outlook
• Recognize reigns of emperors as DATE entities
• Coreferences in general
• HeidelTime:
second and third quarter of the first century A. D. −→ XXXX-Q3; 00
• Returning to difference in ceramic recording details
• Fabric names often contain locations, e. g. Magdalensberg xyz
• Location sometimes narrow, sometimes whole regions
• In many cases the form is not named in particular but just described
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 20/22
References
Ettlinger, Elisabeth. Conspectus formarum terrae sigillatae Italico modo confectae.
Ed. by Deutsches Archäologisches Institut zu Frankfurt and
Römisch-Germanische Kommission. Materialien zur römisch-germanischen
Keramik. Bonn: Habelt, 1990.
Gempeler, Robert D. Elephantine X. Die Keramik römischer bis früharabischer Zeit.
Mainz: Von Zabern, 1992.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 21/22
Thank you very much for your attention!
Questions?
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement № 693548
Mining Paper Catalogues
A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology
Tim Evans1, Felix Kußmaul2
23rd Annual Meeting EAA, Maastricht
31 August 2017
1Archaeology Data Service, University of York
2Archaeological Institute, University of Cologne

More Related Content

PDF
Comparing taxonomies for organising collections of documents
PPTX
Ariadne: Interoperability
PDF
Innovative methods for data integration: Linked Data and NLP
PDF
Data Sharing as Publication: A View from Archaeology
DOCX
What is What, When?
PDF
Maria Theodoridou Semantic Integration Experiments
PDF
Annotating Archaeological Texts An Example Of Domain-Specific Annotation In ...
PDF
Linked data for knowledge curation in humanities research
Comparing taxonomies for organising collections of documents
Ariadne: Interoperability
Innovative methods for data integration: Linked Data and NLP
Data Sharing as Publication: A View from Archaeology
What is What, When?
Maria Theodoridou Semantic Integration Experiments
Annotating Archaeological Texts An Example Of Domain-Specific Annotation In ...
Linked data for knowledge curation in humanities research

Similar to Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology (20)

PDF
Interpretation, Context, and Metadata: Examples from Open Context
PDF
Achille Felicetti - ARIADNE Semantic Integration of Archaeological Information
PPTX
Enrichment of Cross-Lingual Information on Chinese Genealogical Linked Data
PPTX
Use of ontologies in natural language processing
PPT
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
PPTX
Semantics and the Humanities: some lessons from my journey 2000-2012
DOC
Statistical Named Entity Recognition for Hungarian – analysis ...
PDF
Cross-lingual event-mining using wordnet as a shared knowledge interface
PPT
- Models and Paradigms In Archaeology
PDF
Labels in the web of data
PPT
Folksonomies in Museums
PPTX
Named Entity Recognition for Europeana Newspapers
PPT
Integrating archaeological data: The ARIADNE Infrastructure, Achille Felicett...
PDF
Learning Multilingual Semantics from Big Data on the Web
PDF
Introduction to Archaeology
PPTX
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
PPTX
Anthropology: Archaeology
PDF
Introducing CIDOC-CRM (Cch KR workshop #2.1)
PPT
Vocabularies as Linked Data - OUDCE March2014
PDF
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
Interpretation, Context, and Metadata: Examples from Open Context
Achille Felicetti - ARIADNE Semantic Integration of Archaeological Information
Enrichment of Cross-Lingual Information on Chinese Genealogical Linked Data
Use of ontologies in natural language processing
EAA2013 Archaeological Recording Methods - How Many Archaeologists does it t...
Semantics and the Humanities: some lessons from my journey 2000-2012
Statistical Named Entity Recognition for Hungarian – analysis ...
Cross-lingual event-mining using wordnet as a shared knowledge interface
- Models and Paradigms In Archaeology
Labels in the web of data
Folksonomies in Museums
Named Entity Recognition for Europeana Newspapers
Integrating archaeological data: The ARIADNE Infrastructure, Achille Felicett...
Learning Multilingual Semantics from Big Data on the Web
Introduction to Archaeology
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
Anthropology: Archaeology
Introducing CIDOC-CRM (Cch KR workshop #2.1)
Vocabularies as Linked Data - OUDCE March2014
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
Ad

More from ArchAIDE Project (20)

PDF
Presentation and results of ArchAIDE project - EAA2018
PPTX
Talking about the revolution. Innovation in communication within the ArchAIDE...
PPTX
Workshop, Athens 14 May 2018
PPTX
Fair of European Innovators in Cultural Heritage
PPTX
Italian training day. Pisa, 23 Marzo 2018 Il progetto
PPTX
Una nuova frontiera per la documentazione e l’interpretazione della ceramica
PPTX
EVA/Minerva Conference on Digitisation of Cultural Heritage
PPTX
II Congreso Internacional de musealización y puesta en valor del Patrimonio C...
PPTX
Campagne fotografiche sulle classi ceramiche test (WP5)
PPTX
Una rete neurale per il riconoscimento automatico della ceramica: il progetto...
PPTX
A mobile app for the automatic recognition of archaeological potsherds: the A...
PPTX
Navigating a new digital interface: using automated image recognition to iden...
PPTX
Una rete neurale per l’archeologia
PDF
ArchAIDE Projekttreffen und EAA in Maastricht
PDF
InnovativeTechnologies
PDF
Development and analysis of 3D reference collections from archaeological arch...
PDF
Michael Remmy, WP5: Population of the database
PDF
Populating the Reference Database Photographing Collections
PDF
ArchAIDE Kick-Off Meeting - WP5
PPTX
Eva Miguel Gascon, Mireia Pinto Monte, Marisol Madrid i Fernandez, Jaume Buxe...
Presentation and results of ArchAIDE project - EAA2018
Talking about the revolution. Innovation in communication within the ArchAIDE...
Workshop, Athens 14 May 2018
Fair of European Innovators in Cultural Heritage
Italian training day. Pisa, 23 Marzo 2018 Il progetto
Una nuova frontiera per la documentazione e l’interpretazione della ceramica
EVA/Minerva Conference on Digitisation of Cultural Heritage
II Congreso Internacional de musealización y puesta en valor del Patrimonio C...
Campagne fotografiche sulle classi ceramiche test (WP5)
Una rete neurale per il riconoscimento automatico della ceramica: il progetto...
A mobile app for the automatic recognition of archaeological potsherds: the A...
Navigating a new digital interface: using automated image recognition to iden...
Una rete neurale per l’archeologia
ArchAIDE Projekttreffen und EAA in Maastricht
InnovativeTechnologies
Development and analysis of 3D reference collections from archaeological arch...
Michael Remmy, WP5: Population of the database
Populating the Reference Database Photographing Collections
ArchAIDE Kick-Off Meeting - WP5
Eva Miguel Gascon, Mireia Pinto Monte, Marisol Madrid i Fernandez, Jaume Buxe...
Ad

Recently uploaded (20)

PPTX
Unit 8#Concept of teaching and learning.pptx
PDF
Presentation on cloud computing and ppt..
PPTX
Phylogeny and disease transmission of Dipteran Fly (ppt).pptx
PDF
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
PPTX
Bob Difficult Questions 08 17 2025.pptx
PDF
Microsoft-365-Administrator-s-Guide_.pdf
PPTX
Lesson-7-Gas. -Exchange_074636.pptx
PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
Shizophrnia ppt for clinical psychology students of AS
PDF
5_tips_to_become_a_Presentation_Jedi_@itseugenec.pdf
PDF
COLEAD A2F approach and Theory of Change
PPTX
Copy- of-Lesson-6-Digestive-System.pptx
PPTX
Research Process - Research Methods course
PPTX
Kompem Part Untuk MK Komunikasi Pembangunan 5.pptx
PPTX
NORMAN_RESEARCH_PRESENTATION.in education
PPTX
Module_4_Updated_Presentation CORRUPTION AND GRAFT IN THE PHILIPPINES.pptx
PPTX
ANICK 6 BIRTHDAY....................................................
PPTX
Rakhi Presentation vbbrfferregergrgerg.pptx
PPTX
power point presentation ofDracena species.pptx
PPTX
PurpoaiveCommunication for students 02.pptx
Unit 8#Concept of teaching and learning.pptx
Presentation on cloud computing and ppt..
Phylogeny and disease transmission of Dipteran Fly (ppt).pptx
PM Narendra Modi's speech from Red Fort on 79th Independence Day.pdf
Bob Difficult Questions 08 17 2025.pptx
Microsoft-365-Administrator-s-Guide_.pdf
Lesson-7-Gas. -Exchange_074636.pptx
Sustainable Forest Management ..SFM.pptx
Shizophrnia ppt for clinical psychology students of AS
5_tips_to_become_a_Presentation_Jedi_@itseugenec.pdf
COLEAD A2F approach and Theory of Change
Copy- of-Lesson-6-Digestive-System.pptx
Research Process - Research Methods course
Kompem Part Untuk MK Komunikasi Pembangunan 5.pptx
NORMAN_RESEARCH_PRESENTATION.in education
Module_4_Updated_Presentation CORRUPTION AND GRAFT IN THE PHILIPPINES.pptx
ANICK 6 BIRTHDAY....................................................
Rakhi Presentation vbbrfferregergrgerg.pptx
power point presentation ofDracena species.pptx
PurpoaiveCommunication for students 02.pptx

Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology

  • 1. Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology Tim Evans1, Felix Kußmaul2 23rd Annual Meeting EAA, Maastricht 31 August 2017 1Archaeology Data Service, University of York 2Archaeological Institute, University of Cologne
  • 2. Data Source Figure 1: Sample from Ettlinger, Conspectus. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 2/22
  • 3. Oh dear! Problem Running texts contain a lot of irrelevant information (for machine processing). This makes database lookups without keywords extremely inefficient. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 3/22
  • 4. What we have: What we want: { "form": "23.1", "origin": "Italy", "decoration": "none", "occurs": "uncommon" }, { "form": "23.2", "origin": "Italy, not Padana", "occurs": "Mediterranean region; North-Italy" } UNSTRUCTURED DATA STRUCTURED DATA T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
  • 5. What we have: What we want: { "form": "23.1", "origin": "Italy", "decoration": "none", "occurs": "uncommon" }, { "form": "23.2", "origin": "Italy, not Padana", "occurs": "Mediterranean region; North-Italy" } UNSTRUCTURED DATA STRUCTURED DATA T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
  • 6. What we have: What we want: { "form": "23.1", "origin": "Italy", "decoration": "none", "occurs": "uncommon" }, { "form": "23.2", "origin": "Italy, not Padana", "occurs": "Mediterranean region; North-Italy" } UNSTRUCTURED DATA STRUCTURED DATA T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 4/22
  • 7. Information Extraction Definition: Information Extraction (IE) “[IE refers to] the identification and extraction of instances of a particular class of events or relationships in a natural language text and their transformation into a structured representation.” – Grishman 1997, Eikvil 1999 T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 5/22
  • 8. IE Process Pipeline Stanford ClearNLP OpenNLP CoreNLP UIMA Ruta Tokenisation Lemmatisation POS-Tagging NER Information Extraction unstructured document structured data PosMapper CoreNLP MatePos OpenNLP CoreNLP Figure 2: IE process pipeline. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 6/22
  • 9. Named Entity Recognition The quick brown fox jump over the lazy dog . DT JJ JJ NN VBD IN DT JJ NN . jumps Figure 3: POS-tagging examples after lemmatisation. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
  • 10. Named Entity Recognition The quick brown fox jump over the lazy dog . DT JJ JJ NN VBD IN DT JJ NN . jumps Figure 3: POS-tagging examples after lemmatisation. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 7/22
  • 11. Adapting the NER Most NERs (e. g. Stanford CoreNLP) only recognise 8 entities types: PERSON DATE ORGANIZATION TIME LOCATION MONEY PERCENT MISC So we have to add the custom entity type FORM. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 8/22
  • 12. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Machine-learning approach • Lower precision, but high recall • Needs to be trained! T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 13. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Figure 4: Excerpt from Gempeler, Elephantine X. Machine-learning approach • Lower precision, but high recall • Needs to be trained! T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 14. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Figure 4: Excerpt from Gempeler, Elephantine X. Machine-learning approach • Lower precision, but high recall • Needs to be trained! T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 15. Two approaches for NER Rule-based approach • High precision, but lower recall ⇒ Many many rules?! Figure 4: Excerpt from Gempeler, Elephantine X. Machine-learning approach • Lower precision, but high recall • Needs to be trained! Figure 5: Manually annotated sentence from Ettlinger, Conspectus in iepy. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 9/22
  • 16. Temporal Expressions With HeidelTime temporal expressions are mapped to TIMEX3 standard around 140 B. C. −→ APPROX BC0140 Spätes 3.–4. Jh. n. Chr. −→ END 02; 03 second quarter first century B. C. −→ XXXX-Q2 BC00 first half third century A. D. −→ XXXX-H1 02 HeidelTime supports many other languages, e. g. German, Italian, French, … T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 10/22
  • 17. Relation Extraction Subject Relation Object quick brown fox jump over lazy dog K 612 dates 031 Subform 23.2 occurs North Italy Subform 23.2 dates XXXX-Q2 002 ⇒ e. g. { "form": "23.2", "dating": "XXXX-Q2 00" } 1“4th century A. D.” 2“second and third quarters of the first century A. D.” T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
  • 18. Relation Extraction Subject Relation Object quick brown fox jump over lazy dog K 612 dates 031 Subform 23.2 occurs North Italy Subform 23.2 dates XXXX-Q2 002 ⇒ e. g. { "form": "23.2", "dating": "XXXX-Q2 00" } 1“4th century A. D.” 2“second and third quarters of the first century A. D.” T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 11/22
  • 20. Background Two problems: • Linguistic • Conceptual T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 12/22
  • 21. Different languages T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 13/22
  • 22. Different traditions Figure 6: Plate, platter or dish? T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 14/22
  • 23. Creating controlled vocabularies Creating wordlists that project team would be most useful to describe the key features of a vessel or sherd • Sherd type (e.g. rim or handle) • Form (e.g. plate or bowl) • Decoration form (e.g. burnished) • Decoration color (e.g. yellow) • Fabric (e.g. Dressel 28 fabric) T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 15/22
  • 24. Lessons from ARIADNE Used tools and methodology developed for the ARIADNE project by the Hypermedia Research Group at the University of South Wales • Created a neutral spine based on the Getty Institute’s Art and Architecture Thesaurus (AAT) • This spine was populated by members from partner organisations, identifying common terms and concepts within it • Project partners then mapped terms in their language to this neutral spine • French terms supplied courtesy of a 2001 Masters thesis by Caroline Sourzat (thanks to Eleni Schindler Kaudelka for identifying this on the ArchAIDE blog!) T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 16/22
  • 25. Mapping terms and concepts (part 1) Often this was very straightforward, for example: • The Italian terms graffita, graffita a punta, graffita a stecca = “sgraffito” (http://vocab.getty.edu/aat/300266416) • The Spanish term Cántaro = “jars” (http://vocab.getty.edu/aat/300195348) • The German terms gebogener Henkel, Ohrförmiger Henkel, langer Vertikalhenkel = “handles” (http://vocab.getty.edu/aat/300266416) T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 17/22
  • 26. Mapping terms and concepts (part 2) Often this was more complicated, with partners having differing perceptions on what to call something (e.g. “plate” versus “platter”) In truth, this confusion may also be reflected by what has come out of the ground! An advantage of using the AAT (a “SKOS’d” thesaurus), is that ambiguity or difference in nomenclature can be resolved by a broader term or concept, so for example … T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 18/22
  • 27. Mapping terms and concepts (part 3) Looking at the hierarchies for plate and platter in the AAT we can see that both are “dishes (vessels for food)”, or even broader “culinary containers”. So whole we can retain our original classifications (and this is essential for text mining), we can agree at a fundamental level what these fundamentally are Figure 7: AAT Hierarchies for Plate and Platter T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 19/22
  • 28. Outlook • Recognize reigns of emperors as DATE entities • Coreferences in general • HeidelTime: second and third quarter of the first century A. D. −→ XXXX-Q3; 00 • Returning to difference in ceramic recording details • Fabric names often contain locations, e. g. Magdalensberg xyz • Location sometimes narrow, sometimes whole regions • In many cases the form is not named in particular but just described T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 20/22
  • 29. References Ettlinger, Elisabeth. Conspectus formarum terrae sigillatae Italico modo confectae. Ed. by Deutsches Archäologisches Institut zu Frankfurt and Römisch-Germanische Kommission. Materialien zur römisch-germanischen Keramik. Bonn: Habelt, 1990. Gempeler, Robert D. Elephantine X. Die Keramik römischer bis früharabischer Zeit. Mainz: Von Zabern, 1992. T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 21/22
  • 30. Thank you very much for your attention! Questions? This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement № 693548
  • 31. Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology Tim Evans1, Felix Kußmaul2 23rd Annual Meeting EAA, Maastricht 31 August 2017 1Archaeology Data Service, University of York 2Archaeological Institute, University of Cologne