Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology

Mining Paper Catalogues
A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology
Tim Evans1, Felix Kußmaul2
23rd Annual Meeting EAA, Maastricht
31 August 2017
1Archaeology Data Service, University of York
2Archaeological Institute, University of Cologne

Data Source
Figure 1: Sample from Ettlinger, Conspectus.
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: 31 August 2017 2/22

Oh dear!
Problem
Running texts contain a lot of irrelevant information (for machine processing).
This makes database lookups without keywords extremely inefﬁcient.

What we have: What we want:
{
"form": "23.1",
"origin": "Italy",
"decoration": "none",
"occurs": "uncommon"
},
{
"form": "23.2",
"origin": "Italy, not Padana",
"occurs": "Mediterranean region;
North-Italy"
}
UNSTRUCTURED DATA STRUCTURED DATA

Information Extraction
Deﬁnition: Information Extraction (IE)
“[IE refers to] the identiﬁcation and extraction of instances of a particular class
of events or relationships in a natural language text and their transformation
into a structured representation.” – Grishman 1997, Eikvil 1999

IE Process Pipeline
Stanford ClearNLP OpenNLP CoreNLP UIMA Ruta
Tokenisation Lemmatisation POS-Tagging NER Information
Extraction
unstructured
document structured data
PosMapper CoreNLP MatePos OpenNLP CoreNLP
Figure 2: IE process pipeline.

Named Entity Recognition
The quick brown fox jump over the lazy dog .
DT JJ JJ NN VBD IN DT JJ NN .
jumps
Figure 3: POS-tagging examples after lemmatisation.

Adapting the NER
Most NERs (e. g. Stanford CoreNLP) only recognise 8 entities types:
PERSON DATE
ORGANIZATION TIME
LOCATION MONEY
PERCENT MISC
So we have to add the custom entity type FORM.

Two approaches for NER
Rule-based approach
• High precision, but lower recall
⇒ Many many rules?!
Machine-learning approach
• Lower precision, but high recall
• Needs to be trained!

Rule-based approach
Figure 4: Excerpt from Gempeler,
Elephantine X.

Rule-based approach
Figure 4: Excerpt from Gempeler,
Elephantine X.
Figure 5: Manually annotated sentence
from Ettlinger, Conspectus in iepy.

Temporal Expressions
With HeidelTime temporal expressions are mapped to TIMEX3 standard
around 140 B. C. −→ APPROX BC0140
Spätes 3.–4. Jh. n. Chr. −→ END 02; 03
second quarter first century B. C. −→ XXXX-Q2 BC00
first half third century A. D. −→ XXXX-H1 02
HeidelTime supports many other languages, e. g. German, Italian, French, …

Relation Extraction
Subject Relation Object
quick brown fox jump over lazy dog
K 612 dates 031
Subform 23.2 occurs North Italy
Subform 23.2 dates XXXX-Q2 002
⇒ e. g.
{ "form": "23.2",
"dating": "XXXX-Q2 00" }
1“4th century A. D.”
2“second and third quarters of the ﬁrst century A. D.”

Background
Two problems:
• Linguistic
• Conceptual
T. Evans & F. Kußmaul (York, Cologne) Mining Paper Catalogues: Multilingualism 31 August 2017 12/22

Different languages

Different traditions
Figure 6: Plate, platter or dish?

Creating controlled vocabularies
Creating wordlists that project team would be most useful to describe the key
features of a vessel or sherd
• Sherd type (e.g. rim or handle)
• Form (e.g. plate or bowl)
• Decoration form (e.g. burnished)
• Decoration color (e.g. yellow)
• Fabric (e.g. Dressel 28 fabric)

Lessons from ARIADNE
Used tools and methodology developed for the ARIADNE project by the
Hypermedia Research Group at the University of South Wales
• Created a neutral spine based on the Getty Institute’s Art and Architecture
Thesaurus (AAT)
• This spine was populated by members from partner organisations,
identifying common terms and concepts within it
• Project partners then mapped terms in their language to this neutral spine
• French terms supplied courtesy of a 2001 Masters thesis by Caroline Sourzat
(thanks to Eleni Schindler Kaudelka for identifying this on the ArchAIDE blog!)

Mapping terms and concepts (part 1)
Often this was very straightforward, for example:
• The Italian terms graffita, graffita a punta, graffita a stecca = “sgraffito”
(http://vocab.getty.edu/aat/300266416)
• The Spanish term Cántaro = “jars” (http://vocab.getty.edu/aat/300195348)
• The German terms gebogener Henkel, Ohrförmiger Henkel, langer
Vertikalhenkel = “handles” (http://vocab.getty.edu/aat/300266416)

Often this was more complicated, with partners having differing perceptions on
what to call something (e.g. “plate” versus “platter”)
In truth, this confusion may also be reﬂected by what has come out of the ground!
An advantage of using the AAT (a “SKOS’d” thesaurus), is that ambiguity or
difference in nomenclature can be resolved by a broader term or concept, so for
example …

Looking at the hierarchies for plate and platter in the AAT we can see that both
are “dishes (vessels for food)”, or even broader “culinary containers”. So whole
we can retain our original classiﬁcations (and this is essential for text mining), we
can agree at a fundamental level what these fundamentally are
Figure 7: AAT Hierarchies for Plate and Platter

Outlook
• Recognize reigns of emperors as DATE entities
• Coreferences in general
• HeidelTime:
second and third quarter of the ﬁrst century A. D. −→ XXXX-Q3; 00
• Returning to difference in ceramic recording details
• Fabric names often contain locations, e. g. Magdalensberg xyz
• Location sometimes narrow, sometimes whole regions
• In many cases the form is not named in particular but just described

References
Ettlinger, Elisabeth. Conspectus formarum terrae sigillatae Italico modo confectae.
Ed. by Deutsches Archäologisches Institut zu Frankfurt and
Römisch-Germanische Kommission. Materialien zur römisch-germanischen
Keramik. Bonn: Habelt, 1990.
Gempeler, Robert D. Elephantine X. Die Keramik römischer bis früharabischer Zeit.
Mainz: Von Zabern, 1992.

Thank you very much for your attention!
Questions?
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement № 693548

Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology

More Related Content

Similar to Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology (20)

More from ArchAIDE Project (20)

Recently uploaded (20)

Mining Paper Catalogues A Multilingual Solution to Reduce Verbose Fields to Consistent Terminology