SlideShare a Scribd company logo
How to read a million books
Clemens Neudecker
Staatsbibliothek zu Berlin
17 May 2017
How to read a million books?
Clemens Neudecker
Staatsbibliothek zu Berlin
17 May 2017
How to read a million books
newspapers?
Clemens Neudecker
Staatsbibliothek zu Berlin
17 May 2017
How to read* a million newspapers?
(* hint: do try this at home)
Clemens Neudecker
Staatsbibliothek zu Berlin
17 May 2017
https://xkcd.com/1838
Agenda
• Digital Collections
• Data, Tools, Formats
• Europeana Newspapers
• NLP challenges
• Experiments & use cases
About me
• Research Coordinator @ Berlin State Library
• M.A. Philosophy, Computer Science, Political Science
• Mostly curious about
– Optical Character Recognition, Document Analysis
– Natural Language Processing
– Digital Humanities
• More: @cneudecker, cneud.net
Staatsbibliothek zu Berlin
• Established 1661 as the Library of the King of Prussia
• Today largest research library in Germany,
with approx. 11.5m volumes (23m objects)
• Part of the „Stiftung Preußischer Kulturbesitz“,
a unique union of museums, archives, libraries
and research institutes from Berlin
• http://staatsbibliothek-berlin.de/
How to read a million books?
Digitisation 2.0
Europeana
• http://www.europeana.eu/portal/
• Europe‘s Digital Library
• > 53m objects incl.
art, sound, fashion,...
• API: http://labs.europeana.eu/api
DPLA
• https://dp.la/
• Digital Public
Library of America
• > 16m objects
• API: https://dp.la/info/developers/codex/
Hathi Trust
• https://www.hathitrust.org/
• Public copy of
Google Books
• > 15m volumes
• API: https://www.hathitrust.org/data
DDB
• http://ddb.de/
• Germany‘s federal
Digital Library
• > 9m objects
• API: https://api.deutsche-digitale-bibliothek.de/
Trove
• http://trove.nla.gov.au/
• Digital Library
of Australia
• > 540m objects
• API:
http://help.nla.gov.au/trove/building-with-trove/api
Formats & Standards
• What data is available?
• Typically, a digital object is composed of:
– Scanned Images in TIFF, JP2 or JPEG
– Descriptive metadata in DublinCore
– Structural metadata in METS
– Text content in ALTO or TEI
– Europeana in EDM
– Linked Data in RDF or JSON-LD
Tools (1/3)
• OAI-PMH
– https://pypi.python.org/pypi/Sickle
– https://pypi.python.org/pypi/pyoai
– https://pypi.python.org/pypi/oaiharvest
• METS
– https://pypi.python.org/pypi/metsrw
– https://pypi.python.org/pypi/pymets
Tools (2/3)
• DublinCore
– https://pypi.python.org/pypi/pydc
– https://pypi.python.org/pypi/dcxml
• Europeana
– https://pypi.python.org/pypi/europeana-search
– https://pypi.python.org/pypi/django-europeana
Tools (3/3)
• IIIF
– https://pypi.python.org/pypi/iiif/
– https://pypi.python.org/pypi/Flask-IIIF/
• KB NL
– https://pypi.python.org/pypi/kb
– https://github.com/KBNLresearch/intro-kb-apis
– http://lab.kb.nl/
Europeana Newspapers
• EU-project to make Europe‘s historical
newspapers searchable & accessible
• http://www.europeana-newspapers.eu/
Europeana Newspapers Collection
• 12 million historic newspaper pages text
(> 10.000.000.000 tokens)
• 40 languages, 4 alphabets
• 400 years (1618 – 2016)
• http://www.theeuropeanlibrary.org/tel4/newspapers
OCR / OLR
(U.lag nul «chestttetrung- ■geeinoel II, Setch«it,zen I—Ig Ufr sterntpeechee g» U II.
für ftrene-geingelpilche: 13 01191 nnd 13 03 11 io"gl f l««lt-beOeu; OetHn *1,
blnftraße IS IZeinsptechee; H I Sanemeinummet gurfilrft 8MB); ««de«: gdn.o(tio||e III
(ZemlpreAei 284.3»). Iie.gonlen nur nnier heimnnn » Erden bei der veutlchen Bonl
»n« Vtdennld-Getelltchoil gttloto bumduig. Commerz- nndprinoldonlN voINchrSomI
bomduig u 189 ß>, .»ontbeegee Kochelchlen- eitchelne» 12 mal wSchenNIch. täglich
zweimal — morgen« nnd ndendn —, Sonntage nnr morgen». Toonlnge nur abend»
Zn den Kochdorerlen wird die Ndend-Nuegode noch am üben!
Dieser Entwurf ist. wie Bürgermeister Roß mitteilte, den Fraktionen zur
Stellungnahme vorgelegt worden. Zum Donnerstag war eine zweite Sitzung der
Fraktionsfübrer vom Vertreter des Senats angeordnet worden, zu der ober zwei
Fraktionen, die Teutschnationalen und die Nationalsozialisten, nicht erschienen
waren. Von den Nationalsozialisten ist kurz vor Beginn der Sitzung eine telephonische
Erklärung abgegeben worden, etwa des Inhalts, daß die Fraktion sich den sachlichen
Verhandlungen entziehen müsse, solange nicht gewisse Vorbedingungen erfüllt sein.
http://www.theeuropeanlibrary.org/tel4/newspapers/issue/Hamburger_Nachrichten/1932/12/31
How to read a million books?
https://github.com/cneud/alto-tools
Performance
82.4%
85.3%
80.9%
75.9%
67.5%
83.4% 84.1%
68.1%
93.1%
57.6%
87.0%
68.3%
76.1%
82.6%
54.1%
32.7%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SuccessRate
Language Setting
Bag of Words OCR Evaluation
Per Language
79.1%
62.2%
55.9%
58.8%
94.7%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Keyword
search
Phrase search Access via
content
structure
Print/ebook
on demand
Content
based image
retrieval
SuccessRate(harmonic,areabased)
Evaluation Profile
Layout Analysis Performance
Per evaluation profile
Experimental(!) Downloads
• http://data.theeuropeanlibrary.org/download/
newspapers-by-country/README.html
• http://research.europeana.eu/itemtype/
newspapers
• http://test-solr-mongo.eanadev.org/
europeana-research-newspapers-dump/
sample-2017-04-26/Staatsbibliothek_zu_Berlin_
Preu%253Fischer_Kulturbesitz/titles.html
OCR
• EU: IMPACT project (2008-2012)
• US: eMOP project (2013-2015)
• DE: OCR-D project (2016-2018)
• Google:
–Tesseract
–ocropy (fka OCRopus)
–Aksara
Named Entity Recognition
• 3 Categories:
– PERSON; LOCATION; ORGANIZATION
• 3 Languages:
– Dutch; French; German
• Powered by Stanford CoreNLP - CRF-NER
Annotations
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate
(Bag of Words)
Reading Order
Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
Evaluation
Dutch French
Challenges
• https://github.com/EuropeanaNewspapers/
ner-corpora/wiki/Corpus-cleanup
Lack of metadata
Issue
There is no associated metadata for the annotated
text (newspaper title, date, etc.)
Solution
Automatically match lines with newspaper pages
through keyword search
OCR errors vs. historical spelling
Issue
Text contains OCR errors but also
valid(!) historical spelling variants
Solution
Document language profiling to distinguish
OCR errors and spelling variants
theylteil eyeitht
   ,
Sentence splits
Issue
During data pre-processing, (parts) of
sentences have been erroneously cut
Solution
Reconstruct sentences through keyword
search and matching procedure
Hyphenation
Issue
Text contains hyphenation to be removed
but hyphens do also occur in regular text
Solution
Use a tokenizer to determine hyphens to be
removed
Missing tags
Issue
Human operators forgot to tag some entities
or tagged them with the wrong category
Solution
???
Punctuation
Issue
According to CONLL, punctuation should be in
a separate line from the token - but
abbreviations…
Solution
???
https://altomator.github.io/EN-data_mining/
https://github.com/altomator/EN-data_mining
http://www.kbresearch.nl/dictionary/
https://github.com/jlonij/dictionary-viewer
http://www.kbresearch.nl/telraam/
https://gist.github.com/WillemJan/6ab02c48af576ba47b68
http://ngramviewer.kbresearch.nl/
https://bitbucket.org/ilps/pm-ngramviewers-kbkranten
http://www.digitalvictorianist.com/
https://twitter.com/VictorianHumour
https://github.com/BL-Labs/embellishments
http://networks.viraltexts.org/1836to1899/index.html
https://github.com/dasmiq/passim
EUROPEANA
TRANSCRIBATHON
CAMPUS BERLIN 2017
22-23 June 2017
Berlin State Library
http://pro.europeana.eu/
event/europeana-
transcribathon-campus-
2017
Thank you for your attention!
Clemens Neudecker
Staatsbibliothek zu Berlin
17 May 2017

More Related Content

PPTX
Stiller & Király, Multilinguality of Metadata
PPTX
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
PPT
The Great Twentieth-Century Hole Or, what the Digital Humanities Miss
PPTX
Intro to IIIF and IIIF @NLW
PPTX
New approaches for data acquisition at europeana iiif, sitemaps and schema.o...
PPT
Europeana Newspapers -
PDF
Social software engineering and Open science
PDF
Representation and Absence in Digital Resources: The Case of Europeana Newspa...
Stiller & Király, Multilinguality of Metadata
Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s...
The Great Twentieth-Century Hole Or, what the Digital Humanities Miss
Intro to IIIF and IIIF @NLW
New approaches for data acquisition at europeana iiif, sitemaps and schema.o...
Europeana Newspapers -
Social software engineering and Open science
Representation and Absence in Digital Resources: The Case of Europeana Newspa...

What's hot (20)

PPT
Europeana in a Research Context
PPT
Europeana, more than data aggregation?
PPT
British Library Labs Presentation at Ed Tech Hackathon 2013 - hackathoncentra...
PPT
Profiling Web Archives
PPTX
Linked Data: principles and examples
PPTX
One day workshop Linked Data and Semantic Web
PPT
Wikidata, a target for Europeana's semantic strategy - GLAM-WIKI 2015
PDF
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
PDF
The Power Of The User
PPTX
Doing Digital Research @ British Library
PDF
EDL Stockholm
PPT
Multilingual challenges in Europeana
PDF
Open data and entrepreneurship
PDF
Europeana Research Panel DH Benelux 2017
PDF
Keynote csws2013
PPTX
Prototype on Illuminated Manuscripts
PPT
Linked Open Data
PPT
Post-Its and Placemarks
PPTX
Estermann Panel on Authority Files, 3 June 2020
PPT
Europeana and Schema.org - DC2013
Europeana in a Research Context
Europeana, more than data aggregation?
British Library Labs Presentation at Ed Tech Hackathon 2013 - hackathoncentra...
Profiling Web Archives
Linked Data: principles and examples
One day workshop Linked Data and Semantic Web
Wikidata, a target for Europeana's semantic strategy - GLAM-WIKI 2015
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
The Power Of The User
Doing Digital Research @ British Library
EDL Stockholm
Multilingual challenges in Europeana
Open data and entrepreneurship
Europeana Research Panel DH Benelux 2017
Keynote csws2013
Prototype on Illuminated Manuscripts
Linked Open Data
Post-Its and Placemarks
Estermann Panel on Authority Files, 3 June 2020
Europeana and Schema.org - DC2013
Ad

Similar to How to read a million books? (20)

PDF
Europeana Newspapers - Data, Tools & Future Plans
PPTX
The Use of Big Data Techniques for Digital Archiving
PPTX
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
PDF
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
PDF
Session5 03.george rehm
PDF
E-ARK-iPRES2016-Bern-October-2016
PDF
Methodological Guidelines for Publishing Linked Data
PDF
TEAMS 6, 7 and 8
PDF
Infrastructure crossroads... and the way we walked them in DKPro
PDF
The drawbridge to knowledge - Linking scholarly publications and research inf...
PPTX
ResearchSpace- Example of a VRE Based on CIDOC CRM
PDF
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
PDF
Redesigning our Combine Harvester
PPT
Digital Archiving at the Meertens Institute
PPT
Digital archaeology and museums
PPTX
What's up, Europeana Newspapers?
PDF
ARCLib project presentation from Pasig 2016
PPTX
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
PDF
Making Research Data Repositories Visible – The re3data.org Registry
PPT
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Europeana Newspapers - Data, Tools & Future Plans
The Use of Big Data Techniques for Digital Archiving
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
Session5 03.george rehm
E-ARK-iPRES2016-Bern-October-2016
Methodological Guidelines for Publishing Linked Data
TEAMS 6, 7 and 8
Infrastructure crossroads... and the way we walked them in DKPro
The drawbridge to knowledge - Linking scholarly publications and research inf...
ResearchSpace- Example of a VRE Based on CIDOC CRM
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Redesigning our Combine Harvester
Digital Archiving at the Meertens Institute
Digital archaeology and museums
What's up, Europeana Newspapers?
ARCLib project presentation from Pasig 2016
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Making Research Data Repositories Visible – The re3data.org Registry
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
Ad

More from cneudecker (20)

PPTX
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
PPTX
ALTO, PAGE & Co. Formate für Volltexte
PPTX
OCR und Strukturerkennung für Zeitungen
PPTX
Digitisation and Digital Humanities - what is the role of Libraries?
PPTX
Multimodal Perspectives for Digitised Historical Newspapers
PPTX
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
PPTX
AI for digitized cultural heritage
PPTX
Kuratieren mit künstlicher Intelligenz
PPTX
Überblick zum DFG-Projekt OCR-D
PDF
The many uses of digitized newspapers
PPTX
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
PPTX
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PPTX
Text and Data Mining
PPTX
Formate für Volltexte
PPTX
Extrablatt: The Latest News on Newspaper Digitisation in Europe
PPTX
Reise durch Europeana Collections in 11 Minuten
PPTX
Europeana Newspapers in a Nutshell
PPTX
lab.sbb.berlin
PPTX
Named Entity Recognition for Europeana Newspapers
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
ALTO, PAGE & Co. Formate für Volltexte
OCR und Strukturerkennung für Zeitungen
Digitisation and Digital Humanities - what is the role of Libraries?
Multimodal Perspectives for Digitised Historical Newspapers
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
AI for digitized cultural heritage
Kuratieren mit künstlicher Intelligenz
Überblick zum DFG-Projekt OCR-D
The many uses of digitized newspapers
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
OCR-D: An end-to-end open source OCR framework for historical printed documents
Text and Data Mining
Formate für Volltexte
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Reise durch Europeana Collections in 11 Minuten
Europeana Newspapers in a Nutshell
lab.sbb.berlin
Named Entity Recognition for Europeana Newspapers

Recently uploaded (20)

PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Architecture types and enterprise applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
project resource management chapter-09.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
OMC Textile Division Presentation 2021.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Developing a website for English-speaking practice to English as a foreign la...
A contest of sentiment analysis: k-nearest neighbor versus neural network
NewMind AI Weekly Chronicles – August ’25 Week III
Univ-Connecticut-ChatGPT-Presentaion.pdf
Architecture types and enterprise applications.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Getting started with AI Agents and Multi-Agent Systems
Programs and apps: productivity, graphics, security and other tools
project resource management chapter-09.pdf
Module 1.ppt Iot fundamentals and Architecture
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A comparative study of natural language inference in Swahili using monolingua...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Final SEM Unit 1 for mit wpu at pune .pptx
A novel scalable deep ensemble learning framework for big data classification...

How to read a million books?