Scaling up the Extraction of Canonical Citations in Classics

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Scaling up the Extraction of Canonical
Citations in Classics
Matteo Romanello (DAI / KCL) @mr56k
Humanitiés Numériques et Antiquité – 4 Sept. 2015

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Prologue

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Centrality of References
Referring as scholarly primitive (Unsworth):
Referring, Discovering, Annotating, Comparing, Sampling,
Illustrating, Representing
References in Classics:
canonical texts, fragmentary texts, inscriptions, papyri,
manuscripts, coins, etc.
Ubiquity of References
journal articles, reviews, monographs, indexes, commentaries

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Classical Commentaries

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Enhanced Reading (1)
Line-by-line Bibliographical Database of Wolfram von Eschenbach’s
Parzival, http://wolfram.lexcoll.com/txts/index.htm

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
http://labs.jstor.org/shakespeare/

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Segetes, http://segetes.io/aeneid

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Hellespont Project
http://gapvis.hellespont.dainst.org/#book/1/read/113/

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Trends in Text Reception
Neville Morley, Number Crunching,
http://thesphinxblog.com/2015/06/25/number-crunching/

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Citation Networks
with applications to:
1 search
2 document clustering
3 formal network analysis

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Approach

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Rationale
beyond string-based search
references not quotations
scalable approach:
language independent
applicable to large amounts of documents
easily adaptable to diﬀerent materials and ways of
referencing
Examples:
In Statius’ « Achilleid » (2, 96-102) Achilles describes […]
e.g. Vergil, Aen. 12, 101-109 ; Lucan 1, 204-212 ; Statius,
Th. 12, 736-740 […]

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Named Entity Recognition
Computer Science > Information Extraction > NER
Question Answering System
Q: where did Aaron Swartz die?
A: New York
Two days after the prosecution rejected a counter-oﬀer by
Swartz, he was found dead in his Brooklyn, New York
apartment, where he had hanged himself.
3-step process:
1 Named Entity Recognition and Classiﬁcation
2 Relation Extraction
3 Named Entity Disambiguation

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Citation Extraction: Step 1 (NER)
Named Entities (= citation components):
AAUTHOR = ancient author
AWORK = ancient work
REFAUWORK = concise reference to author, work or both
(“Pliny, nat.”, “Thuc.”)
REFSCOPE = indication of the cited passage (“11, 4, 11”)

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Citation Extraction: Step 2 (Relation Detection)
reference as relation vs. reference as monolithic entity
binary scope relation between two entities (arguments)
arg1: aauthor | awork | refauwork
arg2: refscope
examples:
Ammianus (15, 8, 7)
Trabajos 159–173”
Pliny, nat. 11, 4, 11

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Citation Extraction: Step 3 (Disambiguation)
assign each author/work/canonical reference a unique ID
IDs are CTS URNs

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Canonical Text Services (CTS) Unique Resource
Names (URNs)
A machine-readable syntax for canonical references [refs]
Pliny
urn:cts:latinLit:phi0978
Pliny’s NH
urn:cts:latinLit:phi0978.phi001
Pliny, Nat. 11,4,11
urn:cts:latinLit:phi0978.phi001:11.4.11

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
The Extraction Pipeline

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Evaluation

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
L’Année philologique (APh)
http://www.annee-philologique.com/

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
APh Example
APh 75-06697 => S. Braund & G. Gilbert. 2004. “An ABC of epic ira:
anger, beasts, and cannibalism” Yale Classical Studies 32:250-285
In Statius ’ « Achilleid » (2, 96-102) Achilles describes his diet
of wild animals in infancy, which rendered him fearless and may
indicate another aspect of his character - a tendency toward
aggression and anger.
The portrayal of angry warriors in Roman epic is eﬀected for
the most part not by direct descriptions but indirectly, by
similes of wild beasts (e.g. Vergil, Aen. 12, 101-109;
Lucan 1, 204-212; Statius, Th. 12, 736-740; Silius 5, 306-315).
These similes may be compared to two passages from
Statius (Th. 1, 395-433 and 8, 383-394) that portray the onset
of anger in direct narrative. Analysis of these passages
demonstrates that the concept of « ira » in epic takes its moral
aspect from the context.

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
The Data
APh
analytical reviews (en, de, fr, es, it)
80 volumes (1924-)
autom. processed vol. 75 (2004)
6,694 abstracts (total = 6,946; errors = 252)
350k tokens
3k citations
man. corrected ~8 % of vol. 75
366 abstracts
26k tokens
380 citations

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Precision, Recall and F1 Score
https://commons.wikimedia.org/wiki/File:
Precisionrecall.svg
By Walber (Own work) [CC BY-SA 4.0]

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Evaluation Summary
Task Precision Recall F1 Score
NER 79.24% 69.62% 73.88%
RelEx 93.33% 91.87% 92.60%
NED 61.04% 90.94% 73.05%
methods:
NER: machine learning-based
RelEx: rule-based
NED: rule-based + knowledge base
manually corrected ~8 % of vol. 75
366 abstracts
26k tokens
380 citations

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NER: Method
1 Linguistic Features (PoS Tag, neighbouring words)
2 Word-level Features:
punctuation
ﬁnal_dot, quotation_mark, has_hyphen, bracket
case
mixed_caps, all_caps, init_caps, all_lower
number
roman, year, range, mixed_alphanum
patterns
“Avien.” –> “Aaaaa-” (expanded)
“Avien.” –> “Aa-” (compressed)
3 Semantic Features (matches against dictionaries)

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NER: Evaluation
Task: extraction of entities aauthor, awork, refauwork,
refscope
Algorithm Precision Recall F1 Score
CRF 79.24% 69.62% 73.88%
MaxEnt 75.29% 66.75% 70.43%
SVM 74.44% 70.21% 71.93%
Aauthor : P = 91.15%, R = 39.67%, F1 = 54.53%

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
RelEx: Evaluation
rule-based method
Precision Recall F1 Score
93.33% 91.87% 92.60%
Missed scope relations:
du [REFSCOPE chant 4] de l’ [AWORK « Énéide » ]
Le [REFSCOPE livre 13 ] de la [AWORK « Chronique » ]
les [REFSCOPE v. 9–12 ] des [AWORK « Acharniens » ]

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NED: Method
Thuc. I 89, 1s.
1 match reference against knowledge base
exact and approximate string matching
approximate string matching:
edit_distance("Virgilio","Virgil") = 3
Thuc. → urn:cts:greekLit:tlg0003.tlg001
2 normalise the reference scope
e.g. 1.89.1–1.89.2
1.89.1–2
1, 89, 1–2
I 89, 1s.
3 assign unique ID
urn:cts:greekLit:tlg0003.tlg001:1.89.1–1.89.2

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NED: Knowledge Base
underlying model: HuCit, CIDOC-CRM & FRBRoo
usages:
extract abbreviations
resolve implicit refs:, e.g. “Herod. 4, 5-7”
validate citations, e.g. “Thuc. 1.100.9.4”

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NED: Evaluation
Matching Type Precision Recall F1 Score
Exact 58.33% 62.88% 60.52%
Approximate (n=4) 61.04% 90.94% 73.05%
Approximate (n=7) 58.94% 94.76% 72.67%

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NED: Error Types
1 abbreviation is highly ambiguous (without context)
But Horace undermines the suggestion that his own
poetry will forever represent the Augustan Age.
Carm. 4, 15 in fact […]
2 ambiguous author mention
Esame dell’ esegesi papiracea ad Aristofane :
permanenza del lavoro degli eruditi alessandrini […]

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
NED: Error Types (contd.)
3 implied context (title of reviewed publ.)
Dans son chap. 5 sur le squelette et la respiration,
Lactance utilise des sources disparates et arrive aux
limites de son savoir médical.
4 ambiguously expressed reference
Analysis of the pederastic poems in the Theocritean
corpus (12 ; 23 ; 29 ; 30) reveals that Theocritus
reﬂects on mutuality in a relationship […]

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Outlook

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Future Plans
1 improve overall accuracy
test other methods for each processing step
more training data (and more specialised)
expand knowledge base
2 make software available to others
streamline installation
improve documentation
oﬀer as web service
oﬀer as part of a research infrastructure
3 apply on a larger scale
improve performances (optimisation)
use of high performance and parallel computing

Scaling up the
Extraction of
Canonical
Citations in
Classics
Matteo
Romanello
(DAI / KCL)
@mr56k
Prologue
Approach
Evaluation
Outlook
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
Thank you for your attention!
Links
matteo.romanello@gmail.com
https://github.com/mromanello/CRefEx
https://github.com/mromanello/APh_Corpus

Scaling up the Extraction of Canonical Citations in Classics

More Related Content

Similar to Scaling up the Extraction of Canonical Citations in Classics (20)

More from Matteo Romanello (18)

Recently uploaded (20)

Scaling up the Extraction of Canonical Citations in Classics