SlideShare a Scribd company logo
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Structural analysis of documents
Functional Extension Parser (FEP)

Günter Mühlberger
University Innsbruck Library (UIBK)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Agenda
      Introduction
      Features
        – What do we recognise with the structural analysis?
      Benefits
        – Why is structural analysis useful?
      Architecture
        – How does it work?
      Results
        – How good are we?
      Roadmap
        – When will it come into being?
      Business
        – Which offers will be available?
                                                                                                                                                         2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Introduction
      Document understanding platform
      Try to enhance and exploit the logical structure of documents for
        – Display
        – Navigation
        – Retrieval
      Enhance OCR output with structural metadata
        – Fully automated processing
        – Interactive correction




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Features
      General
        – We are able to recognise all structural elements which have some layout
          representation: e.g. region, size, typeface, distance to other elements, etc.
        – Focus in IMPACT: Basic features which are typical for all documents
        – Rules set can be extended or specified according to other datasets
                      E.g. journals, dissertations, index cards, yearbooks, newspapers, etc.
        – The better the OCR, the better our structural analysis
      Basic features for books
        –     Page numbers
        –     Running titles (headers)
        –     Print space
        –     Footnotes
        –     Signature marks
        –     Headings (within the running text)
        –     Table of contents entries (additional to headings)
        –     Front/Body/Back
        –     Paragraphs
                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Print space
      Headings
      Footnotes




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Running title (header)
      Page number
      Signature mark




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Table of contents
        – (linked with headings in
          the running text,
          respectively page
          numbers)




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (1)
      Display
        – Correct print space allows to display images centred (no flipping
          between pages)
      Search & retrieval
        – Scoring of results
                      Could take into account structural data (headings, footnotes)
        – Noise reduction
                      Front, body, back are separated, text from the front is often misleading
                      Running titles repeat the same words
                      Footnotes can be included or excluded
        – Facetted search
                      Results can be displayed for running text, footnotes, headings



                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Benefits (2)
      Navigation
        – Page numbers allow usage of original table of contents
        – Original table of contents can be linked with headings/page numbers in
          the book
      Document editing
        –     Further mark up (e.g. TEI) is supported
        –     Manual preparation for Print-on-Demand is eased (print space)
        –     Selective OCR correction can be applied:
        –     E.g. only headings, running text, footnotes could be fed to CONCERT
      Document matching
        – Contributions or footnotes can be matched with existing bibliographical
          databases


                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved display in the
      Internet and PDF




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Refinement of full-text
      search
      Facets for e.g.
        – Running text
        – Footnotes
        – Headings
      Less noise
        – Running titles,
          signature marks
          excluded from search




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Clickable table of contents
      entries
        – Google style
      Selective OCR correction
        – Correct only ToC,
          headings, footnotes, etc.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Matching of documents
      with external sources
        – Match footnotes with
          library catalogues
          (bibliographies)Clickable
          table of content
        – Match table of contents
          entries and headings with
          bibliographies




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Improved editing
        – Alternating print spaces
          for Print on Demand
        – Further processing for
          TEI editions etc.




                                                                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Architecture
      Input
        – Results from OCR processing on word level (coordinates)
        – E.g. ALTO file, ABBYY XML file or Google HTML
      Output
        – Structural annotations for recognized text features, e.g. page numbers,
          running titles, headings, etc.
        – E.g. XML, ALTO, METS, TEI, etc.
      General workflow
        –     OCR result files are parsed (FEP general XML format)
        –     Rules set is applied to the dataset (rules are managed by rules engine)
        –     Results are stored in a database
        –     Export on various levels is provided
      Optional
        – Online or offline correction (GUI)
        – Adaptation of rules set
        – Quality assurance on basis of ground truth

                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The FEP Core
      Based on expert-system like rule engine for java (Jess)
      Both manually crafted rules and rules obtained by machine learning
      Uses fuzzy logic to deal with uncertainty

Typical rules:
      IF there is a numeral in the first line of the page AND this numeral is centred
      THEN this numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      right hand side of the page AND this numeral is an odd number THEN this
      numeral may be the page number
      IF there is a numeral in the first line of the page AND this numeral is at the
      left hand side of the page AND this numeral is an even number THEN this
      numeral may be the page number.

IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results
      Basic rules set
        – General features for books from 1700 to 2000
        – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set)
        – All books were manually annotated (ground truth)
      Recall, Precision, F-Measure
        – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are
          correct, 4 are false.
        – Recall                = 8 of 10                   = 0,8
        – Precision             = 8 of 12                   = 0,66
        – F-Measure             = 2*0.8*0.66/(0.8+0.66)     = 0,72
      More explanations
        – Important: We are counting lines, not structural items!
                      E.g. a heading consists of two lines (often with different size of typeface we have
                      to find both to succeed)
        – Difference between training and evaluation sets are marginal

                                                                                                                                                         18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  Results on Evaluation Set

                                                Recall                            Precision                          F‐measure
Running text                                                           0,99                              0,98                              0,98
Footnotes                                                              0,83                              0,89                              0,86
Page numbers                                                           0,97                                      1                         0,98
Running titles                                                         0,97                                      1                         0,98
Heading                                                                0,85                              0,80                              0,82
Signature marks                                                        0,68                              0,89                              0,77
                                                                                                                                                           19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Roadmap
      Summer 2011: Beta version
        – Integration into IMPACT Interoperability Platform
        – Basic rules set: books from 1700 to 1900
      End of the year: Version 1.0
        – Full featured version
        – Enhanced online correction interface
        – FEP as a service, not as a product for local installation




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Business offers
      Web-service for processing single volumes and correction
        – Will be integrated into eBooks-on-Demand EOD Network
        – Already now 30 libraries are uploading their images to OCR server in
          Innsbruck
        – FEP will be an additional service for general material
        – Similar offers can be made to other libraries or networks as well
      Adaptation of rules set
        – For specific datasets much more can be detected than just the basic
          features
        – E.g. journals with a fixed structure over many years or parliamentary
          papers, dissertations, research papers, etc.
        Onsite installations
        – Not our focus, but could be done for very large datasets or due to legal
          requirements (e.g. Google images)

                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Results: TOC

      25 TOC entries in total
      22 TOC entries are completely correct
      1 TOC entry was missed
      2 TOC entries are grouped incorrectly
      1 TOC entry has no link
      1 TOC entry has a wrong link




IMPACT EVA/MINERVA 12th Nov. 2008                                                                                                                        26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Thank you for your attention!




                                                                                                                                                         27

More Related Content

PDF
Computer Lexica in OCR and Retrieval
PDF
IMPACT OCR in a nutshell. Clemens Neudecker
PDF
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
PPTX
IMPACT Final Conference - Muehlberger - FEP
PPTX
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
PPTX
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
PDF
IMPACT: Building a Centre of Competence for Digitisation
PDF
The Improving Access to Text (IMPACT) project and other European initiatives
Computer Lexica in OCR and Retrieval
IMPACT OCR in a nutshell. Clemens Neudecker
IMPACT Interoperability and Evaluation Framework. Clemens Neudecker
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Language Parallel Sessions - Gotscharek
IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT: Building a Centre of Competence for Digitisation
The Improving Access to Text (IMPACT) project and other European initiatives

What's hot (14)

PDF
Bratislava WS - Schlarb - ONB - technical tools_pdf
PDF
Enoll hannover-2013-anna
PDF
Tel concertation meeting project presentations - 7-2-2014
PDF
Jpl cv en JUN10
PDF
Jpl cv en_mar11
PDF
iDiscover: Towards the next generation of contextualised mobile museum guides
PDF
Jpl Cv En v2
PDF
CV of Joao Penha-Lopes (En)
PPTX
J01 dov winer_scientix_national_workshop_20151109
PDF
J01 dov winer_scientix_national_workshop_20151109
PPT
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
PDF
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
PDF
EuropeanaConnect
PDF
I Lab2 I Lab Vision Ws 3 Oct 06 V2
Bratislava WS - Schlarb - ONB - technical tools_pdf
Enoll hannover-2013-anna
Tel concertation meeting project presentations - 7-2-2014
Jpl cv en JUN10
Jpl cv en_mar11
iDiscover: Towards the next generation of contextualised mobile museum guides
Jpl Cv En v2
CV of Joao Penha-Lopes (En)
J01 dov winer_scientix_national_workshop_20151109
J01 dov winer_scientix_national_workshop_20151109
Workshop Barcelona: Copyright Limitations and Exceptions for Education in the EU
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demo...
EuropeanaConnect
I Lab2 I Lab Vision Ws 3 Oct 06 V2
Ad

Similar to Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger (20)

PPT
Workflow Development for OCR (and beyond)
PPT
IMPACT Demo Dag at KB
PPT
An Experimental Workflow Development Platform for Historical Document Digitis...
PPT
IMPACT HPC Cloud Day
PPT
IMPACT at OCR Summit
PPT
Experimental Workflow Development in Digitisation
PPTX
IMPACT Final Conference - Stefan Pletschacher
PPT
OCR challenges in historic documents and the contribution of IMPACT
PDF
Centre of Competence in digitisation. Clemens Neudecker
PDF
Targeted Language Resources for the Digitisation of Historical Collections
PDF
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
PPTX
text summarization
PDF
Models and tools for aggregating and annotating content on ECLAP
PPTX
Presentation of Clemens Neudecker, BnF Information Day
PDF
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
PDF
Share.TEC dissemination activities 2009, E. Axdorph
PDF
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
PDF
Dissemination activities
PDF
Europeana Newspapers LFT Infoday Muehlberger
Workflow Development for OCR (and beyond)
IMPACT Demo Dag at KB
An Experimental Workflow Development Platform for Historical Document Digitis...
IMPACT HPC Cloud Day
IMPACT at OCR Summit
Experimental Workflow Development in Digitisation
IMPACT Final Conference - Stefan Pletschacher
OCR challenges in historic documents and the contribution of IMPACT
Centre of Competence in digitisation. Clemens Neudecker
Targeted Language Resources for the Digitisation of Historical Collections
IMPACT Final Event 26-06-2012 - Library experiences in IMPACT: National and ...
text summarization
Models and tools for aggregating and annotating content on ECLAP
Presentation of Clemens Neudecker, BnF Information Day
Mapping the European TEL Project Landscape Using Social Network Analysis and ...
Share.TEC dissemination activities 2009, E. Axdorph
Dissemination activities(E. Axdorph) - 2nd Share.TEC project workshop
Dissemination activities
Europeana Newspapers LFT Infoday Muehlberger
Ad

More from Biblioteca Nacional de España (20)

PDF
La colección de relaciones de sucesos en la Biblioteca Nacional de España
PDF
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
PDF
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
PDF
Data privacy in library authority files: a survey
PDF
Perfil de RDA de la BNE. Resumen de cambios
PDF
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
PDF
RDA: el nuevo texto
PDF
Pleno del Real Patronato. Biblioteca Nacional de España
PDF
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
PDF
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
PDF
Evaluación actuaciones 2018. Planificación actuaciones 2019
PDF
Dirección Técnica. Objetivos 2019
PDF
Evaluación 2018. Objetivos 2019
PDF
Evaluación actuaciones 2018. Dirección Cultural
PDF
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
PDF
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
PDF
Renacer prensa historica
PDF
RDA y Linked data (Ricardo Santos Muñoz)
PDF
Desarrollo actual de RDA (Pilar Tejero López)
La colección de relaciones de sucesos en la Biblioteca Nacional de España
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
Data privacy in library authority files: a survey
Perfil de RDA de la BNE. Resumen de cambios
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA: el nuevo texto
Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Evaluación actuaciones 2018. Planificación actuaciones 2019
Dirección Técnica. Objetivos 2019
Evaluación 2018. Objetivos 2019
Evaluación actuaciones 2018. Dirección Cultural
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Renacer prensa historica
RDA y Linked data (Ricardo Santos Muñoz)
Desarrollo actual de RDA (Pilar Tejero López)

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
August Patch Tuesday
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
A comparative analysis of optical character recognition models for extracting...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
OMC Textile Division Presentation 2021.pptx
TLE Review Electricity (Electricity).pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cloud_computing_Infrastucture_as_cloud_p
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
DP Operators-handbook-extract for the Mautical Institute
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
1 - Historical Antecedents, Social Consideration.pdf
Unlocking AI with Model Context Protocol (MCP)
August Patch Tuesday
Hindi spoken digit analysis for native and non-native speakers
Programs and apps: productivity, graphics, security and other tools
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...

Structural analysis of documents Functional Extension Parser (FEP). Günter Mühlberger

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Structural analysis of documents Functional Extension Parser (FEP) Günter Mühlberger University Innsbruck Library (UIBK)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Agenda Introduction Features – What do we recognise with the structural analysis? Benefits – Why is structural analysis useful? Architecture – How does it work? Results – How good are we? Roadmap – When will it come into being? Business – Which offers will be available? 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Introduction Document understanding platform Try to enhance and exploit the logical structure of documents for – Display – Navigation – Retrieval Enhance OCR output with structural metadata – Fully automated processing – Interactive correction IMPACT EVA/MINERVA 12th Nov. 2008 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Features General – We are able to recognise all structural elements which have some layout representation: e.g. region, size, typeface, distance to other elements, etc. – Focus in IMPACT: Basic features which are typical for all documents – Rules set can be extended or specified according to other datasets E.g. journals, dissertations, index cards, yearbooks, newspapers, etc. – The better the OCR, the better our structural analysis Basic features for books – Page numbers – Running titles (headers) – Print space – Footnotes – Signature marks – Headings (within the running text) – Table of contents entries (additional to headings) – Front/Body/Back – Paragraphs 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Print space Headings Footnotes 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Running title (header) Page number Signature mark 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Table of contents – (linked with headings in the running text, respectively page numbers) 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (1) Display – Correct print space allows to display images centred (no flipping between pages) Search & retrieval – Scoring of results Could take into account structural data (headings, footnotes) – Noise reduction Front, body, back are separated, text from the front is often misleading Running titles repeat the same words Footnotes can be included or excluded – Facetted search Results can be displayed for running text, footnotes, headings 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits (2) Navigation – Page numbers allow usage of original table of contents – Original table of contents can be linked with headings/page numbers in the book Document editing – Further mark up (e.g. TEI) is supported – Manual preparation for Print-on-Demand is eased (print space) – Selective OCR correction can be applied: – E.g. only headings, running text, footnotes could be fed to CONCERT Document matching – Contributions or footnotes can be matched with existing bibliographical databases 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved display in the Internet and PDF 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Refinement of full-text search Facets for e.g. – Running text – Footnotes – Headings Less noise – Running titles, signature marks excluded from search 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Clickable table of contents entries – Google style Selective OCR correction – Correct only ToC, headings, footnotes, etc. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Matching of documents with external sources – Match footnotes with library catalogues (bibliographies)Clickable table of content – Match table of contents entries and headings with bibliographies 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved editing – Alternating print spaces for Print on Demand – Further processing for TEI editions etc. 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture Input – Results from OCR processing on word level (coordinates) – E.g. ALTO file, ABBYY XML file or Google HTML Output – Structural annotations for recognized text features, e.g. page numbers, running titles, headings, etc. – E.g. XML, ALTO, METS, TEI, etc. General workflow – OCR result files are parsed (FEP general XML format) – Rules set is applied to the dataset (rules are managed by rules engine) – Results are stored in a database – Export on various levels is provided Optional – Online or offline correction (GUI) – Adaptation of rules set – Quality assurance on basis of ground truth 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The FEP Core Based on expert-system like rule engine for java (Jess) Both manually crafted rules and rules obtained by machine learning Uses fuzzy logic to deal with uncertainty Typical rules: IF there is a numeral in the first line of the page AND this numeral is centred THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the right hand side of the page AND this numeral is an odd number THEN this numeral may be the page number IF there is a numeral in the first line of the page AND this numeral is at the left hand side of the page AND this numeral is an even number THEN this numeral may be the page number. IMPACT EVA/MINERVA 12th Nov. 2008 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results Basic rules set – General features for books from 1700 to 2000 – Dataset of 155 books, 30.673 pages (141 training set, 41 evaluation set) – All books were manually annotated (ground truth) Recall, Precision, F-Measure – E.g. 10 lines with headings in a book. We find 12 lines, 8 of them are correct, 4 are false. – Recall = 8 of 10 = 0,8 – Precision = 8 of 12 = 0,66 – F-Measure = 2*0.8*0.66/(0.8+0.66) = 0,72 More explanations – Important: We are counting lines, not structural items! E.g. a heading consists of two lines (often with different size of typeface we have to find both to succeed) – Difference between training and evaluation sets are marginal 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results on Evaluation Set Recall Precision F‐measure Running text 0,99 0,98 0,98 Footnotes 0,83 0,89 0,86 Page numbers 0,97 1 0,98 Running titles 0,97 1 0,98 Heading 0,85 0,80 0,82 Signature marks 0,68 0,89 0,77 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Roadmap Summer 2011: Beta version – Integration into IMPACT Interoperability Platform – Basic rules set: books from 1700 to 1900 End of the year: Version 1.0 – Full featured version – Enhanced online correction interface – FEP as a service, not as a product for local installation 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Business offers Web-service for processing single volumes and correction – Will be integrated into eBooks-on-Demand EOD Network – Already now 30 libraries are uploading their images to OCR server in Innsbruck – FEP will be an additional service for general material – Similar offers can be made to other libraries or networks as well Adaptation of rules set – For specific datasets much more can be detected than just the basic features – E.g. journals with a fixed structure over many years or parliamentary papers, dissertations, research papers, etc. Onsite installations – Not our focus, but could be done for very large datasets or due to legal requirements (e.g. Google images) 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT EVA/MINERVA 12th Nov. 2008 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Results: TOC 25 TOC entries in total 22 TOC entries are completely correct 1 TOC entry was missed 2 TOC entries are grouped incorrectly 1 TOC entry has no link 1 TOC entry has a wrong link IMPACT EVA/MINERVA 12th Nov. 2008 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention! 27