SlideShare a Scribd company logo
How to create
a corpus of
machine-readable texts:
challenges and solutions
What is OCR and how does it work?
Definition of OCR according to the Oxford
Dictionary of Computer Science, p. 379:
„OCR = optical character recognition; a
process in which a machine scans,
recognizes, and encodes information
printed or typed in alphanumerical
characters. (…) OCR software is now
readily available for many low-cost
scanners giving good recognition rates for
printed material using the Latin
alphabet. The more difficult problems
posed by other character sets and
handwriting are areas of ongoing
research.“
When was OCR software invented?
mid-1970s: OCR A font and
OCR B font (similar to
normal letter-press
appearance)
Ca. 1955: early OCR
devices only recognised
limited set of characters in
machine-optimised font
Do we encounter OCR in everyday-life?
High accuracy rates have popularised OCR in the following areas:
• banking (machines „reading“ paper cheques and transfer forms)
• public administration
• health-care (e.g. machine-readable precriptions)
NOTE:
In cases where absolute perfection is needed,
OCR A and OCR B fonts are still used.
If sensitive information is handled, OCR
technology can be combined with the so-called
MICR technology (magnetic-ink character
recognition) checking the legitimacy or
originality of paper documents.
Are humanities tools using OCR?
Google Books: full-text search +
highlighting of text results
HathiTrust full-text view
What are ORC problems in historical research?
„Hannoverisches Magazin”, 1776
Best historical OCR results:
texts in standardised formats (e.g. periodicals)
Improving results for minority languages and old fonts –
an on-going challenge
Recent innovation: merging OCR and handwriting
recognition technologies (HWR/HTR)
“Handwriting recognition (HWR), also known as Handwritten Text
Recognition (HTR), is the ability of a computer to receive and interpret
intelligible handwritten input from sources such as paper documents,
photographs, touch-screens and other devices. The image of the written
text may be sensed "off line" from a piece of paper by optical scanning
(optical character recognition) or intelligent word recognition. Alternatively,
the movements of the pen tip may be sensed "on line", for example by a
pen-based computer screen surface, a generally easier task as there are
more clues available. A handwriting recognition system handles formatting,
performs correct segmentation into characters, and finds the most plausible
words.”
Wikipedia.org
The machine learning revolution in OCR
How does machine learning work?
Cf. Stanford OCR pipeline:
• text detection (layout recognition)
• character segmentation (using
„sliding window“ technique)
• character classification
• spell correction
(http://doremi2016.logdown.com/posts/
2017/01/20/standford-machine-
learning-photo-ocr-machine-learning-
pipeline)
New OCR tools based on machine learning
E.g. OCR-D project
in Germany:
• improved visual
character
recognition
• context analysis of
n-grams
• trainer feedback to
exclude potential
mistakes
Current range of OCR-tools for researchers
• Transkribus.eu (free of charge, cloud-based, each user contributes training data to the
community)
• OCR4all (free command-line OCR software for desktop-installation, difficult set-up,
does not run smoothly on Windows)
• KRAKEN (Python package for OCR, usage not monitored, data do not need to be
shared with developers or others users)
• ABBYY FineReader (one of the most popular proprietary OCR tools)
• Tesseract (originally developed as proprietary software at Hewlett Packard labs in
England and the US, released as open source in 2005, supported by Google since 2006,
available for Linux as well as Windows and Mac OS X, high pre-processing
requirements)
• PICCL/TICCL (free corpus building and corpus clean-up system performing spelling
correction and OCR post-correction, developed for LINUX, requires virtual machine
on Windows)
GBV-Verbund: Intranda OCR Service
And the development continues…
PROs and CONs of open-source OCR software:
CONs:
• takes up a lot of storage space
• difficult installation
• often limited performance on
Windows and Mac)
• usually requires command-line
operation (no GUI)
• conducting own training can be
time-consuming
• copyright issues if software
provider requires you to ingest
your (training) data into a public
pool
PROs:
• flexible integration of historical
texts in different digital formats
• adaptable to multiple languages
and new fonts / page layouts
Photo by Luca Bravo on Unsplash
How reliable are current OCR tools?
Results based on an OCR test based on the US
driver‘s licence, published on September 18,
2019:
https://mobidev.biz/blog/ocr-machine-learning-
implementation
How can we integrate OCR into our own workflow?
Export scan as PDF or
image files to perform OCR!
Analyse non-coded
plain text with topic
modelling or
stylometry tools not
requiring structured
data!
Code information (e.g. in XML/TEI or
JSON) and use software to analyse
networks between specific tagged
entities or visualise geographic data!
Export scan as PDF or image file to let
humans extract metadata and
transcribe the text!
Original manuscript
or print
Use transcriptions to train OCR-
software and improve results for
similar sources (e.g. issues of the
same newspaper)!
Perform quantitative analysis on
more texts in less time and
generate more reliable results!
Testing a cloud-based OCR tool: transkribus.eu

More Related Content

PPTX
Optical Character Recognition
PPT
OCR
PPT
Text reader [OCR]
PDF
Optical Character Recognition: the What, Why, and How
PPTX
OCR 's Functions
PPTX
Optical Character Recognition (OCR)
PPTX
Optical character recognition (ocr) ppt
PPTX
OCR (Optical Character Recognition)
Optical Character Recognition
OCR
Text reader [OCR]
Optical Character Recognition: the What, Why, and How
OCR 's Functions
Optical Character Recognition (OCR)
Optical character recognition (ocr) ppt
OCR (Optical Character Recognition)

What's hot (20)

PDF
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
PPTX
Optical Character Recognition
PDF
Optical Character Recognition Using Python
DOCX
A detailed study and recent research on handwritten recognition
PPTX
Optical Character Recognition (OCR) based Retrieval
PPTX
Machine learning
PDF
OCR Text Extraction
PPTX
Optical Character Recognition( OCR )
PPTX
OCR speech using Labview
PDF
Handwriting recogntion slides boeing
PPT
PDF
Optical Character Recognition (OCR) System
PPTX
Optical Character Recognition
DOCX
Optical character recognition IEEE Paper Study
PPTX
Ocr algorithm for ge’ez characters
PPTX
Basics of-optical-character-recognition
PPTX
Presentation on OCR
PPSX
DOC
Ocr abstract
DOCX
Project report of OCR Recognition
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
Optical Character Recognition
Optical Character Recognition Using Python
A detailed study and recent research on handwritten recognition
Optical Character Recognition (OCR) based Retrieval
Machine learning
OCR Text Extraction
Optical Character Recognition( OCR )
OCR speech using Labview
Handwriting recogntion slides boeing
Optical Character Recognition (OCR) System
Optical Character Recognition
Optical character recognition IEEE Paper Study
Ocr algorithm for ge’ez characters
Basics of-optical-character-recognition
Presentation on OCR
Ocr abstract
Project report of OCR Recognition
Ad

Similar to How to create a corpus of machine-readable texts: challenges and solutions (20)

PDF
D017222226
PPTX
OCR Presentation hjhPresentation 23.pptx
PDF
PDF
50120130406005
PPT
optical character recognition system
PPTX
Paper based interaction
PDF
CRC Final Report
PDF
Bj35343348
PPT
OCR, optical character reader
DOCX
OCR Datasets Unleashed.docx
DOCX
OCR Datasets Unleashed.docx
PPTX
What is OCR Technology and How to Extract Text from Any Image for Free
PDF
From Data Collection to Text Recognition: The OCR Training Dataset Journey
PPTX
OPTICAL CHARACTER RECOGNIZATION NEERAJ.pptx
PDF
Enhancing OCR Accuracy Using Training Datasets for Digital and Printed Text
PPTX
300GroupProject_handwritingsoftware.pptx
DOCX
Optical character recognization word
DOCX
OCR Document Reader Transforming Paper into Digital with Just One Click.docx
PDF
Vexo - Handwriting recognition software
PDF
Volume 2-issue-6-2009-2015
D017222226
OCR Presentation hjhPresentation 23.pptx
50120130406005
optical character recognition system
Paper based interaction
CRC Final Report
Bj35343348
OCR, optical character reader
OCR Datasets Unleashed.docx
OCR Datasets Unleashed.docx
What is OCR Technology and How to Extract Text from Any Image for Free
From Data Collection to Text Recognition: The OCR Training Dataset Journey
OPTICAL CHARACTER RECOGNIZATION NEERAJ.pptx
Enhancing OCR Accuracy Using Training Datasets for Digital and Printed Text
300GroupProject_handwritingsoftware.pptx
Optical character recognization word
OCR Document Reader Transforming Paper into Digital with Just One Click.docx
Vexo - Handwriting recognition software
Volume 2-issue-6-2009-2015
Ad

Recently uploaded (20)

PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Introduction to Data Science and Data Analysis
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Leprosy and NLEP programme community medicine
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Business Analytics and business intelligence.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Transcultural that can help you someday.
PDF
Introduction to the R Programming Language
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Data Science and Data Analysis
climate analysis of Dhaka ,Banglades.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Leprosy and NLEP programme community medicine
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Analytics and business intelligence.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Transcultural that can help you someday.
Introduction to the R Programming Language
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Mega Projects Data Mega Projects Data
Introduction-to-Cloud-ComputingFinal.pptx
Qualitative Qantitative and Mixed Methods.pptx

How to create a corpus of machine-readable texts: challenges and solutions

  • 1. How to create a corpus of machine-readable texts: challenges and solutions
  • 2. What is OCR and how does it work? Definition of OCR according to the Oxford Dictionary of Computer Science, p. 379: „OCR = optical character recognition; a process in which a machine scans, recognizes, and encodes information printed or typed in alphanumerical characters. (…) OCR software is now readily available for many low-cost scanners giving good recognition rates for printed material using the Latin alphabet. The more difficult problems posed by other character sets and handwriting are areas of ongoing research.“
  • 3. When was OCR software invented? mid-1970s: OCR A font and OCR B font (similar to normal letter-press appearance) Ca. 1955: early OCR devices only recognised limited set of characters in machine-optimised font
  • 4. Do we encounter OCR in everyday-life? High accuracy rates have popularised OCR in the following areas: • banking (machines „reading“ paper cheques and transfer forms) • public administration • health-care (e.g. machine-readable precriptions) NOTE: In cases where absolute perfection is needed, OCR A and OCR B fonts are still used. If sensitive information is handled, OCR technology can be combined with the so-called MICR technology (magnetic-ink character recognition) checking the legitimacy or originality of paper documents.
  • 5. Are humanities tools using OCR? Google Books: full-text search + highlighting of text results HathiTrust full-text view
  • 6. What are ORC problems in historical research? „Hannoverisches Magazin”, 1776
  • 7. Best historical OCR results: texts in standardised formats (e.g. periodicals)
  • 8. Improving results for minority languages and old fonts – an on-going challenge
  • 9. Recent innovation: merging OCR and handwriting recognition technologies (HWR/HTR) “Handwriting recognition (HWR), also known as Handwritten Text Recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most plausible words.” Wikipedia.org
  • 10. The machine learning revolution in OCR
  • 11. How does machine learning work? Cf. Stanford OCR pipeline: • text detection (layout recognition) • character segmentation (using „sliding window“ technique) • character classification • spell correction (http://doremi2016.logdown.com/posts/ 2017/01/20/standford-machine- learning-photo-ocr-machine-learning- pipeline)
  • 12. New OCR tools based on machine learning E.g. OCR-D project in Germany: • improved visual character recognition • context analysis of n-grams • trainer feedback to exclude potential mistakes
  • 13. Current range of OCR-tools for researchers • Transkribus.eu (free of charge, cloud-based, each user contributes training data to the community) • OCR4all (free command-line OCR software for desktop-installation, difficult set-up, does not run smoothly on Windows) • KRAKEN (Python package for OCR, usage not monitored, data do not need to be shared with developers or others users) • ABBYY FineReader (one of the most popular proprietary OCR tools) • Tesseract (originally developed as proprietary software at Hewlett Packard labs in England and the US, released as open source in 2005, supported by Google since 2006, available for Linux as well as Windows and Mac OS X, high pre-processing requirements) • PICCL/TICCL (free corpus building and corpus clean-up system performing spelling correction and OCR post-correction, developed for LINUX, requires virtual machine on Windows)
  • 15. And the development continues…
  • 16. PROs and CONs of open-source OCR software: CONs: • takes up a lot of storage space • difficult installation • often limited performance on Windows and Mac) • usually requires command-line operation (no GUI) • conducting own training can be time-consuming • copyright issues if software provider requires you to ingest your (training) data into a public pool PROs: • flexible integration of historical texts in different digital formats • adaptable to multiple languages and new fonts / page layouts Photo by Luca Bravo on Unsplash
  • 17. How reliable are current OCR tools? Results based on an OCR test based on the US driver‘s licence, published on September 18, 2019: https://mobidev.biz/blog/ocr-machine-learning- implementation
  • 18. How can we integrate OCR into our own workflow? Export scan as PDF or image files to perform OCR! Analyse non-coded plain text with topic modelling or stylometry tools not requiring structured data! Code information (e.g. in XML/TEI or JSON) and use software to analyse networks between specific tagged entities or visualise geographic data! Export scan as PDF or image file to let humans extract metadata and transcribe the text! Original manuscript or print Use transcriptions to train OCR- software and improve results for similar sources (e.g. issues of the same newspaper)! Perform quantitative analysis on more texts in less time and generate more reliable results!
  • 19. Testing a cloud-based OCR tool: transkribus.eu