SlideShare a Scribd company logo
Giovanni Colavizza
Matteo Romanello (@mr56k)
Frédéric Kaplan (@frederickaplan)
The References of References:
Enriching Library Catalogs via Domain-Specific
Reference Mining
1
Goal
2
Empowering scholars in
the Humanities with better
IR systems
Motivation - the Scholar
Issues: lack of data [Sula and Miller, 2014] leads to absence of
services: estimated coverage of Web of Science for Humanities
circa 13% [Mingers and Leydesdorff, 2015].
3
Sciences:
Google Scholar
English
mainly papers
Lower-cost information
gathering
Humanities:
no Google Scholar-like system
multiple languages
mainly monographs
Higher-cost information
gathering
Motivation - the Footnote
How humanists cite? Footnotes [see e.g. Hellqvist, 2009]
4
Motivation - the Archive
Approximately half citations to primary sources [Wiberley Jr., 2009]
5
Motivation - the Scholar reloaded
6
Proposal: Enriching library catalogs
7
Use reference monographs, the “canon” of the
domain, to extract references to the rest of the
literature and enrich library catalogs.
Project: Linked Books
Focused on a case study/domain:
the history of Venice.
Partners so far:
• Ca’ Foscari University Library System
• Biblioteca Marciana
• Istituto Veneto di Scienze, Lettere ed Arti
• Archivio di Stato di Venezia
• EPFL
8
The Pipeline
9
Corpus selection
10
Result: 1904 monographs, 701 with
a structured list of references.
Use the means of the library:
1- Consultation shelves
2- Dewey and subject classification
3- Scholarly bibliographies
4- Keyword search
The Pipeline - Digitization
11
Digitization
12
1,904 monographs + ~1,000 journal issues
The Pipeline - Annotation/Extraction/Parsing
13
Annotation
14
• annotated 27% of 701 monographs (with reference list)
• 3.8% of all digitized pages (with references)
• annotators identified 33 citation styles, divided into 6 families
• Yes, humanities scholars love customized reference styles!
Reference Extraction/Parsing
15
[Klinkhammer b-i-secondary-full] [Lutz, i-secondary-full] [L’occupazione i-
secondary-full] [tedesca i-secondary-full] [in i-secondary-full] [Italia i-secondary-full]
[1943-1945, i-secondary-full] [Torino, i-secondary-full] [Bollati i-secondary-full]
[Boringhieri i-secondary-full] [1993 i-secondary-full].
Klinkhammer Lutz, L’occupazione tedesca in Italia 1943-1945,
Torino, Bollati Boringhieri 1993.
[Klinkhammer author] [Lutz, author] [L’occupazione title] [tedesca title]
[in title] [Italia title] [1943-1945, title] [Torino, publicationplace] [Bollati
publisher] [Boringhieri publisher] [1993 publicationyear].
Extraction/Parsing - Evaluation
16
Extraction/Parsing - Confusion Matrix
17
null
author
title
abbrev. (E)
monograph (E)
Task 1
F1 score
(avg) 0.806
class=“null” 0.609
Task 2
F1 score
(avg) 0.842
class=“end abbreviated” 0.242
The Pipeline - Lookup
18
Lookup
19
1. Against OPAC SBN (via API)
Steps:
1. search candidates by title
2. match reference metadata
3. assign each candidate a
confidence score
4. return set of candidates
Evaluation:
• 2k references (out of 181k)
• 41.7% no candidates
• 58.3% with candidates:
• 72.3% -> first candidate
correct
Goal: disambiguation of references
Issues:
• OCR errors -> impact on search by title (low recall)
• API as a “black box” + bottleneck of search by title
Lookup
20
2. Against metadata of digitized books
Lookup
Goal: verify cohesiveness of digitized corpus
Method:
• based on SBN lookup
• but lookup against digitization
metadata
• tuned to maximize precision
• returns 1 or no matches
Evaluation*:
• 500 references (out of 181k)
• precision ~ 1.00
• recall > 0.95
Result:
• only 7% of references extracted from 701 monographs point
inwards (i.e. towards the 1904 monographs)
21
Core of the discipline
co-citation network from
extracted references*
giant component = 59%
of selected corpus
books in the giant
component -> core of
reference works on
history of Venice
giant component ->
32.5% with only works in
consultation
Conclusions and Outlook
22
data- and citation-driven approach to assess and
exploit, from an IR point of view, domain-specific
library holdings on the history of Venice
next big challenge: extraction, consolidation and
disambiguation of references contained within
footnotes (journals)
Giovanni Colavizza
Matteo Romanello (@mr56k)
Frédéric Kaplan (@frederickaplan)
Thank you!
go.epfl.ch/linkedbooks
23

More Related Content

PDF
On Mining Citations to Primary and Secondary Sources in Historiography
PPTX
Oles Petriv “Creating one concept embedding space for persons, brands and new...
PDF
Doing data science with F# (BuildStuff)
PDF
Doing data science with F#
PPTX
F# Data: Making structured data first class citizens
PPTX
Information-rich programming in F# (ML Workshop 2012)
PPT
Giovanni da Verrazano
PPTX
Giovanni da verrazano 3
On Mining Citations to Primary and Secondary Sources in Historiography
Oles Petriv “Creating one concept embedding space for persons, brands and new...
Doing data science with F# (BuildStuff)
Doing data science with F#
F# Data: Making structured data first class citizens
Information-rich programming in F# (ML Workshop 2012)
Giovanni da Verrazano
Giovanni da verrazano 3

Similar to The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining (20)

PDF
Linked Books - DH Venice Fall School 2014
PPT
Linked Data and cultural heritage data: an overview of the approaches from Eu...
PDF
Introduction to the Venice Time Machine
PPT
Facilitating Access and Reuse of Research Materials: the Case of The European...
PPTX
MDST 3703 F10 Seminar 11
PDF
An Ontology For Historical Research Documents
PPT
Links, languages and semantics: linked data approaches in The European Libra...
ODP
Charper.lawdi.20130531
PPT
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...
PDF
Mapping Early Modern News Networks
PDF
Europeana Research Panel DH Benelux 2017
PPTX
Metadata enriching and discovery at Solent University Library
PDF
An information system to access contemporary archives of art: Cavalcaselle, V...
PDF
Representation and Absence in Digital Resources: The Case of Europeana Newspa...
PPTX
Finding Primary Sources and Digital Collections on the Web
PPTX
Semantics and the Humanities: some lessons from my journey 2000-2012
PDF
Linking data in digital libraries the case of puglia digital library
PPT
Innovative Interfaces: making the most of the data we have
PPTX
From Digital Images to Digital Research
PPT
Digitization, Digital Archives and the Italian experience: new prospects for ...
Linked Books - DH Venice Fall School 2014
Linked Data and cultural heritage data: an overview of the approaches from Eu...
Introduction to the Venice Time Machine
Facilitating Access and Reuse of Research Materials: the Case of The European...
MDST 3703 F10 Seminar 11
An Ontology For Historical Research Documents
Links, languages and semantics: linked data approaches in The European Libra...
Charper.lawdi.20130531
Word Occurrence Based Extraction of Work Contributors from Statements of Resp...
Mapping Early Modern News Networks
Europeana Research Panel DH Benelux 2017
Metadata enriching and discovery at Solent University Library
An information system to access contemporary archives of art: Cavalcaselle, V...
Representation and Absence in Digital Resources: The Case of Europeana Newspa...
Finding Primary Sources and Digital Collections on the Web
Semantics and the Humanities: some lessons from my journey 2000-2012
Linking data in digital libraries the case of puglia digital library
Innovative Interfaces: making the most of the data we have
From Digital Images to Digital Research
Digitization, Digital Archives and the Italian experience: new prospects for ...
Ad

More from Giovanni Colavizza (12)

PDF
Sul ruolo dell’umanista nelle Digital Humanities
PDF
La Venice Time Machine e alcune sfide dei progetti “Big Science” nelle discip...
PDF
A Cliometrics’ view on the Garzoni database
PDF
Venice 1740 Reconstruction
PDF
Notes de bas de page: d’un outil savant aux hyperliens
PDF
Report on Ongoing Digitisation and Information System Design for VTM
PDF
Mapping the News Networks in XVII Italy
PDF
Garzoni conference 11 October 2014
PDF
Leipzig Functional Categorisation 11/12/2013
PDF
Udine Digital Humanities 19/11/2013
PDF
Venezia Biblioteche e Digital Humanities 28/10/2013
PDF
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Sul ruolo dell’umanista nelle Digital Humanities
La Venice Time Machine e alcune sfide dei progetti “Big Science” nelle discip...
A Cliometrics’ view on the Garzoni database
Venice 1740 Reconstruction
Notes de bas de page: d’un outil savant aux hyperliens
Report on Ongoing Digitisation and Information System Design for VTM
Mapping the News Networks in XVII Italy
Garzoni conference 11 October 2014
Leipzig Functional Categorisation 11/12/2013
Udine Digital Humanities 19/11/2013
Venezia Biblioteche e Digital Humanities 28/10/2013
Mainz Expert Workshop on Controlled Vocabularies 10/10/2013
Ad

Recently uploaded (20)

PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
Sciences of Europe No 170 (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
2Systematics of Living Organisms t-.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
neck nodes and dissection types and lymph nodes levels
ECG_Course_Presentation د.محمد صقران ppt
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Sciences of Europe No 170 (2025)
Biophysics 2.pdffffffffffffffffffffffffff
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Introduction to Cardiovascular system_structure and functions-1
Phytochemical Investigation of Miliusa longipes.pdf
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
POSITIONING IN OPERATION THEATRE ROOM.ppt
2. Earth - The Living Planet Module 2ELS
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
. Radiology Case Scenariosssssssssssssss
2Systematics of Living Organisms t-.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...

The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining

  • 1. Giovanni Colavizza Matteo Romanello (@mr56k) Frédéric Kaplan (@frederickaplan) The References of References: Enriching Library Catalogs via Domain-Specific Reference Mining 1
  • 2. Goal 2 Empowering scholars in the Humanities with better IR systems
  • 3. Motivation - the Scholar Issues: lack of data [Sula and Miller, 2014] leads to absence of services: estimated coverage of Web of Science for Humanities circa 13% [Mingers and Leydesdorff, 2015]. 3 Sciences: Google Scholar English mainly papers Lower-cost information gathering Humanities: no Google Scholar-like system multiple languages mainly monographs Higher-cost information gathering
  • 4. Motivation - the Footnote How humanists cite? Footnotes [see e.g. Hellqvist, 2009] 4
  • 5. Motivation - the Archive Approximately half citations to primary sources [Wiberley Jr., 2009] 5
  • 6. Motivation - the Scholar reloaded 6
  • 7. Proposal: Enriching library catalogs 7 Use reference monographs, the “canon” of the domain, to extract references to the rest of the literature and enrich library catalogs.
  • 8. Project: Linked Books Focused on a case study/domain: the history of Venice. Partners so far: • Ca’ Foscari University Library System • Biblioteca Marciana • Istituto Veneto di Scienze, Lettere ed Arti • Archivio di Stato di Venezia • EPFL 8
  • 10. Corpus selection 10 Result: 1904 monographs, 701 with a structured list of references. Use the means of the library: 1- Consultation shelves 2- Dewey and subject classification 3- Scholarly bibliographies 4- Keyword search
  • 11. The Pipeline - Digitization 11
  • 12. Digitization 12 1,904 monographs + ~1,000 journal issues
  • 13. The Pipeline - Annotation/Extraction/Parsing 13
  • 14. Annotation 14 • annotated 27% of 701 monographs (with reference list) • 3.8% of all digitized pages (with references) • annotators identified 33 citation styles, divided into 6 families • Yes, humanities scholars love customized reference styles!
  • 15. Reference Extraction/Parsing 15 [Klinkhammer b-i-secondary-full] [Lutz, i-secondary-full] [L’occupazione i- secondary-full] [tedesca i-secondary-full] [in i-secondary-full] [Italia i-secondary-full] [1943-1945, i-secondary-full] [Torino, i-secondary-full] [Bollati i-secondary-full] [Boringhieri i-secondary-full] [1993 i-secondary-full]. Klinkhammer Lutz, L’occupazione tedesca in Italia 1943-1945, Torino, Bollati Boringhieri 1993. [Klinkhammer author] [Lutz, author] [L’occupazione title] [tedesca title] [in title] [Italia title] [1943-1945, title] [Torino, publicationplace] [Bollati publisher] [Boringhieri publisher] [1993 publicationyear].
  • 17. Extraction/Parsing - Confusion Matrix 17 null author title abbrev. (E) monograph (E) Task 1 F1 score (avg) 0.806 class=“null” 0.609 Task 2 F1 score (avg) 0.842 class=“end abbreviated” 0.242
  • 18. The Pipeline - Lookup 18
  • 19. Lookup 19 1. Against OPAC SBN (via API) Steps: 1. search candidates by title 2. match reference metadata 3. assign each candidate a confidence score 4. return set of candidates Evaluation: • 2k references (out of 181k) • 41.7% no candidates • 58.3% with candidates: • 72.3% -> first candidate correct Goal: disambiguation of references Issues: • OCR errors -> impact on search by title (low recall) • API as a “black box” + bottleneck of search by title
  • 20. Lookup 20 2. Against metadata of digitized books Lookup Goal: verify cohesiveness of digitized corpus Method: • based on SBN lookup • but lookup against digitization metadata • tuned to maximize precision • returns 1 or no matches Evaluation*: • 500 references (out of 181k) • precision ~ 1.00 • recall > 0.95 Result: • only 7% of references extracted from 701 monographs point inwards (i.e. towards the 1904 monographs)
  • 21. 21 Core of the discipline co-citation network from extracted references* giant component = 59% of selected corpus books in the giant component -> core of reference works on history of Venice giant component -> 32.5% with only works in consultation
  • 22. Conclusions and Outlook 22 data- and citation-driven approach to assess and exploit, from an IR point of view, domain-specific library holdings on the history of Venice next big challenge: extraction, consolidation and disambiguation of references contained within footnotes (journals)
  • 23. Giovanni Colavizza Matteo Romanello (@mr56k) Frédéric Kaplan (@frederickaplan) Thank you! go.epfl.ch/linkedbooks 23