SlideShare a Scribd company logo
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
An Experimental Workflow Development 
Platform for Historical Document 
Digitisation and Analysis 
Clemens Neudecker, KB National Library of the Netherlands 
International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
2 
Background 
 IMPACT – Improving Access to Text (2008 – 2011) 
Large-scale integrating research project, funded by the EC 
Main objectives: 
- Innovate OCR technology 
- Capacity building in mass-digitisation 
 From a technical perspective: 
> 20 software toolkits for solving specific issues 
Prototyping new algorithms 
“One ring to rule them all…” 
 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
3 
Main requirements 
Behavioural: 
 Minimize integration effort 
 Minimize deployment effort 
 Maximize usability 
 Maximize scalability 
Functional: 
 Modular 
 Transparent 
 Expandable 
 Open source 
 Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
4 
Architecture 
 IMPACT Interoperability Framework: Technologies 
- Java 6 
- Generic Web Service Wrapper 
- Apache Ant/Maven 
- Apache Tomcat/httpd 
- Apache Axis2 
- Apache Synapse 
- Taverna Workflow Engine 
 IMPACT Interoperability Framework: Dataset 
- more than 500.000 images from digital libraries 
- more than 25.000 ground truth transcriptions
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
5 
So how does it work? 
1. Digitisation/OCR challenges registered and tagged in database 
2. Database contains 99,99% correct result: “ground truth” 
3. Researcher develops new method to tackle a problem 
4. Research prototype is wrapped to a web service 
5. Web service is integrated as a workflow module 
6. Workflow module can be evaluated, combined, etc.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
6 
Framework integration 
 Easy to use generic command line wrapper (open source)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
7 
Workflow development 
 OCR workflow = 
data pipeline 
 Building blocks = 
processing steps (nodes) 
 Integration = 
interaction between nodes 
(mashup) 
 Collaboration with
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
8 
Workflow management 
 Web 2.0 style registry: myExperiment 
 Local client: Taverna Workbench 
 Web client: project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
9 
Compute cluster 
 Enterprise Service Bus 
receives requests from 
users and distributes 
the load to the available 
worker nodes 
 Main effect: 
Process parallelization, 
Load distribution, 
Fail over
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
10 
Dataset 
 Access to a representative and annotated dataset of significant size, 
with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
11 
Evaluation features 
 Text based comparison of result with ground truth, 
using Levenshtein distance method 
 Layout based comparison of result with ground truth, 
using the Page Analysis And Ground Truth Elements Framework 
 Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
12 
Community 
 Web2.0 style workflow registry 
 Community of experts 
 Sharing of resources 
 Knowledge exchange 
 A central meeting point 
for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
13 
Summary 
Benefits: 
- Availability of resources (images, ground truth and tools) 
to the international research community 
- A common baseline for transparent evaluation and comparison 
- Sharing of results and know-how 
- Enable new research through scalable computing 
- Consolidation of support and maintenance 
Thank you! 
Questions?

More Related Content

PPT
IMPACT at OCR Summit
PPT
Experimental Workflow Development in Digitisation
PPT
IMPACT Demo Dag at KB
PPT
Presentation of Hans-Jörg Lieder, BnF Information Day
PDF
Europeana Newspapers LFT Infoday Muehlberger
PPTX
Presentation of Clemens Neudecker, BnF Information Day
PDF
Europeana Newspapers LFT Infoday Genereux
PDF
Succeed Introduction - Rafael Carrasco
IMPACT at OCR Summit
Experimental Workflow Development in Digitisation
IMPACT Demo Dag at KB
Presentation of Hans-Jörg Lieder, BnF Information Day
Europeana Newspapers LFT Infoday Muehlberger
Presentation of Clemens Neudecker, BnF Information Day
Europeana Newspapers LFT Infoday Genereux
Succeed Introduction - Rafael Carrasco

What's hot (20)

PPT
agINFRA 5BOAC Presentation
PPT
Europeana Newspapers Project
PPT
Realising the value of Europe's newspaper heritage
PDF
Up2U Worskshop at the TNC18 conference
PDF
17. kb.nederlab.20150324
PPTX
European Open Science Cloud: History and Status
PPTX
Gergely Sipos (EGI): Exploiting scientific data in the international context ...
PDF
The value of EOSC from a user perspective: Key themes and actions from Day 1
PDF
RNP Cloud Infrastructure model, services and challenges
PDF
Up2U Workshop at TNC 2018-introduction
PDF
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
PPT
Benefits of collaborative EU digitization projects
PDF
SEMANCO poster at ESWC 2014
PPT
16,40 16,55 h. open aire eblida-naple conference
PDF
PDF
Opengovinteligence Leaflet
PDF
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
PDF
EOSC-Pillar
PPT
ENP Belgrade Workshop Project Overview
agINFRA 5BOAC Presentation
Europeana Newspapers Project
Realising the value of Europe's newspaper heritage
Up2U Worskshop at the TNC18 conference
17. kb.nederlab.20150324
European Open Science Cloud: History and Status
Gergely Sipos (EGI): Exploiting scientific data in the international context ...
The value of EOSC from a user perspective: Key themes and actions from Day 1
RNP Cloud Infrastructure model, services and challenges
Up2U Workshop at TNC 2018-introduction
FUTEBOL - Federated Union of Telecommunications Research Facilities for an EU...
Benefits of collaborative EU digitization projects
SEMANCO poster at ESWC 2014
16,40 16,55 h. open aire eblida-naple conference
Opengovinteligence Leaflet
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
EOSC-Pillar
ENP Belgrade Workshop Project Overview
Ad

Viewers also liked (16)

DOCX
INSERTAR ELEMENTOS DE FORMULARIO
PPT
Teaching powerpoint
PPT
типы химических связей
PDF
Deportes.
PPT
0. ex physi
PPTX
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
PPT
Search Technologies for Digital Libraries
DOCX
Formularios access 2010
PDF
VideoBoard Digital Signage
PDF
User experience presentation
PPT
User experience
PPTX
Team+2 energyt+storage+system final_2013 spring
PPTX
Mamiferos por karen burbano
PPTX
MAKALAH MEKASNIME DAN KONFLIK DALAM APBN
PPTX
MAKALAH TEORI EKOLOGI ADMINISTRASI
PDF
Construction claims
INSERTAR ELEMENTOS DE FORMULARIO
Teaching powerpoint
типы химических связей
Deportes.
0. ex physi
Digitalisierte Zeitungen und Digital Humanities - Probleme und Chancen
Search Technologies for Digital Libraries
Formularios access 2010
VideoBoard Digital Signage
User experience presentation
User experience
Team+2 energyt+storage+system final_2013 spring
Mamiferos por karen burbano
MAKALAH MEKASNIME DAN KONFLIK DALAM APBN
MAKALAH TEORI EKOLOGI ADMINISTRASI
Construction claims
Ad

Similar to An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis (20)

PPT
IMPACT HPC Cloud Day
PPT
Workflow Development for OCR (and beyond)
PDF
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
PDF
Centre of Competence in digitisation. Clemens Neudecker
PPT
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
PPT
PPT
PPT
The IMPACT Interoperability Framework - Workflows for OCR and beyond
PPT
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
PPTX
IMPACT Final Conference - Stefan Pletschacher
PPTX
IMPACT Final Conference - Muehlberger - FEP
PDF
IMPACT OCR in a nutshell. Clemens Neudecker
PPT
BL Demo Day - July2011 - (1) Introduction to IMPACT
PDF
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
PPT
OCR challenges in historic documents and the contribution of IMPACT
PDF
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
PDF
IMPACT Interoperability Framework - Clemens Neudecker
PDF
IMPACT: Building a Centre of Competence for Digitisation
PDF
The Improving Access to Text (IMPACT) project and other European initiatives
PDF
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.
IMPACT HPC Cloud Day
Workflow Development for OCR (and beyond)
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
Centre of Competence in digitisation. Clemens Neudecker
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
The IMPACT Interoperability Framework - Workflows for OCR and beyond
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Stefan Pletschacher
IMPACT Final Conference - Muehlberger - FEP
IMPACT OCR in a nutshell. Clemens Neudecker
BL Demo Day - July2011 - (1) Introduction to IMPACT
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
OCR challenges in historic documents and the contribution of IMPACT
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
IMPACT Interoperability Framework - Clemens Neudecker
IMPACT: Building a Centre of Competence for Digitisation
The Improving Access to Text (IMPACT) project and other European initiatives
2. Interoperability framework and Taverna. Enrique Molla, Succeed Project.

More from cneudecker (20)

PPTX
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
PPTX
ALTO, PAGE & Co. Formate für Volltexte
PPTX
OCR und Strukturerkennung für Zeitungen
PPTX
Digitisation and Digital Humanities - what is the role of Libraries?
PPTX
Multimodal Perspectives for Digitised Historical Newspapers
PPTX
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
PPTX
AI for digitized cultural heritage
PPTX
Kuratieren mit künstlicher Intelligenz
PPTX
Überblick zum DFG-Projekt OCR-D
PDF
The many uses of digitized newspapers
PPTX
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
PPTX
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PPTX
Text and Data Mining
PPTX
Formate für Volltexte
PPTX
Extrablatt: The Latest News on Newspaper Digitisation in Europe
PPTX
Reise durch Europeana Collections in 11 Minuten
PPTX
Europeana Newspapers in a Nutshell
PPTX
lab.sbb.berlin
PPTX
Named Entity Recognition for Europeana Newspapers
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
ALTO, PAGE & Co. Formate für Volltexte
OCR und Strukturerkennung für Zeitungen
Digitisation and Digital Humanities - what is the role of Libraries?
Multimodal Perspectives for Digitised Historical Newspapers
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
AI for digitized cultural heritage
Kuratieren mit künstlicher Intelligenz
Überblick zum DFG-Projekt OCR-D
The many uses of digitized newspapers
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
OCR-D: An end-to-end open source OCR framework for historical printed documents
Text and Data Mining
Formate für Volltexte
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Reise durch Europeana Collections in 11 Minuten
Europeana Newspapers in a Nutshell
lab.sbb.berlin
Named Entity Recognition for Europeana Newspapers

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
A Presentation on Artificial Intelligence
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
August Patch Tuesday
PPTX
TLE Review Electricity (Electricity).pptx
PDF
project resource management chapter-09.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
Assigned Numbers - 2025 - Bluetooth® Document
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
SOPHOS-XG Firewall Administrator PPT.pptx
Unlocking AI with Model Context Protocol (MCP)
Building Integrated photovoltaic BIPV_UPV.pdf
Zenith AI: Advanced Artificial Intelligence
Approach and Philosophy of On baking technology
cloud_computing_Infrastucture_as_cloud_p
A Presentation on Artificial Intelligence
A novel scalable deep ensemble learning framework for big data classification...
Web App vs Mobile App What Should You Build First.pdf
A comparative analysis of optical character recognition models for extracting...
Enhancing emotion recognition model for a student engagement use case through...
August Patch Tuesday
TLE Review Electricity (Electricity).pptx
project resource management chapter-09.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
NewMind AI Weekly Chronicles - August'25-Week II

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis Clemens Neudecker, KB National Library of the Netherlands International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 2 Background  IMPACT – Improving Access to Text (2008 – 2011) Large-scale integrating research project, funded by the EC Main objectives: - Innovate OCR technology - Capacity building in mass-digitisation  From a technical perspective: > 20 software toolkits for solving specific issues Prototyping new algorithms “One ring to rule them all…”  IMPACT Interoperability Framework (IIF)
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 3 Main requirements Behavioural:  Minimize integration effort  Minimize deployment effort  Maximize usability  Maximize scalability Functional:  Modular  Transparent  Expandable  Open source  Platform independent
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4 Architecture  IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Ant/Maven - Apache Tomcat/httpd - Apache Axis2 - Apache Synapse - Taverna Workflow Engine  IMPACT Interoperability Framework: Dataset - more than 500.000 images from digital libraries - more than 25.000 ground truth transcriptions
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5 So how does it work? 1. Digitisation/OCR challenges registered and tagged in database 2. Database contains 99,99% correct result: “ground truth” 3. Researcher develops new method to tackle a problem 4. Research prototype is wrapped to a web service 5. Web service is integrated as a workflow module 6. Workflow module can be evaluated, combined, etc.
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 Framework integration  Easy to use generic command line wrapper (open source)
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 Workflow development  OCR workflow = data pipeline  Building blocks = processing steps (nodes)  Integration = interaction between nodes (mashup)  Collaboration with
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8 Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: project website
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9 Compute cluster  Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes  Main effect: Process parallelization, Load distribution, Fail over
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 Dataset  Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Evaluation features  Text based comparison of result with ground truth, using Levenshtein distance method  Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework  Example:
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12 Community  Web2.0 style workflow registry  Community of experts  Sharing of resources  Knowledge exchange  A central meeting point for users and researchers
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13 Summary Benefits: - Availability of resources (images, ground truth and tools) to the international research community - A common baseline for transparent evaluation and comparison - Sharing of results and know-how - Enable new research through scalable computing - Consolidation of support and maintenance Thank you! Questions?

Editor's Notes