SlideShare a Scribd company logo
A-I-PoCoTo — Combining Automated and Interactive
OCR Postcorrection
Tobias Englmeier, Florian Fink and Klaus U. Schulz
9. May 2019
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 1 / 16
Overview
Automatic post-correction (A-PoCoTo)
Evaluation results
Automatic interactive post-correction (A-I-PoCoTo)
Resume
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 2 / 16
A-PoCoTo
Automatic post-correction of OCR-results of historical documents using
supervised machine learning.
Multiple OCRs (OCR1, OCR2, . . . , OCRn) can be used
3 steps with two profiling rounds
3 classifiers for 1, 2, . . . , n OCRs
Classifiers are trained using logistic regression
Developed as a module of the OCR-D project 1
1
http://www.ocr-d.de/
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 3 / 16
PoCoTo
PoCoTo (Post-Correction Tool) is a tool for manual interactive
post-correction of OCRed historical Documents.
Initially a desktop application (2014)2
New version as web-application (2017)
Profiling used for error detection and correction suggestions
Batch correction of (error-) patterns
2
Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., & Schulz, K. U. (2014, May).
PoCoTo-an open source system for efficient interactive postcorrection of OCRed
historical texts. In Proceedings of the First International Conference on Digital Access
to Textual Cultural Heritage (pp. 57-61). ACM.
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 4 / 16
Profile (global)
Given an OCRed historical text, the profiling derives a ‘statistical picture’
(guess) of the language in the document using various background lexica3
OCR errors and OCR error series
Historical patterns of the form mod → hist (t → th, ei → ey, ...)
Underlying modern words
The profile is used as a feature generator for the automatic
post-correction system
3
Reffle, U., & Ringlstetter, C. (2013). Unsupervised profiling of OCRed historical
documents. Pattern Recognition, 46(5), 1346-1357.
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 5 / 16
Profile (local)
Profiling associates with each token wocr of a document a set of
interpretations wmod,cand →α whist,cand →β wocr is generated.
α- (historical patterns) and β- (OCR-errors) channels can be empty
Interpretations have a weight
Each wocr has a ranked set of interpretations wcand,hist
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 6 / 16
A-PoCoTo
Alignment Profiling
Lexicon
Extension
Profiling Ranking Decision
A-I-PoCoTo
OCR1 OCR2 OCRn
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 7 / 16
Multiple OCRs
One master-OCR
Additional support-OCRs (optional)
OCRs are token-wise aligned with the master-OCR
Each wocr has n − 1 additional OCR-tokens wocr2 , wocr3 , . . . , wocrn
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 8 / 16
A-PoCoTo — Lexicon Extension step
In the lexicon extension step a classifier tries to find good wocr to extend
the profiler’s back-end resources.
Classification starts after the first profiling round
wocr with a non empty α or β channel are considered
Set of features for each wocr (token-shape, candidate set, unigram
frequencies, agreeing OCRs, . . . )
Classify wocr as True or False
True tokens are put into the extended lexicon for the second profiler
round
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 9 / 16
A-PoCoTo — Ranking step
In the Ranking step the profiler’s candidates are re-ranked.
Classification starts after the second profiling round
All whist,cand for each wocr are considered
Set of features for each whist,cand (token-shape, candidate unigram
frequencies, agreeing OCRs, . . . )
Classifier classifies whist,cand as True or False
Candidates are re-ranked using the classifier’s confidence values
([−1, 1])
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 10 / 16
A-PoCoTo — Decision step
In the Decision step a classifier decides if the best ranked candidate for
any wocr should be used as a correction for wocr .
Re-ranked candidate set for each wocr are considered
Confidence for highest candidate and distance to next candidate are
the features
Classifier classifies highest ranked candidate as True or False
True candidates are used to correct the corresponding wocr
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 11 / 16
A-PoCoTo — Evaluation results
Post-correction model trained on OCR-D4 ground truth
Documents from 16th to 19th century
574 pages from 90 documents (3-6 pages per doc.)
Profile for each document separately
Evaluated two documents:
‘1557, Bodenstein, WieSichMeniglich’ (20 pages)
‘1841, Die Grenzboten’ (50 pages)
Four experiments:
1LE (Only master OCR)
1noLE (Only master OCR, LE step omitted)
2LE (One additional support OCR)
2noLE (One additional support OCR, LE step omitted)
4
http://www.ocr-d.de/gt
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 12 / 16
A-PoCoTo — Evaluation results
2noLE provided best improvement of accuracy:
‘1557, Bodenstein, WieSichMeniglich’:
OCR word accuracy: 65,63% → 69,81%
‘1841, Die Grenzboten’:
OCR word accuracy: 77,57% → 80,63%
Lexicon Extension does not offer benefit
Ranking step help finding the best correction candidate
Support OCR’s offer improvements (if not combined with LE step)
Too many lost chances in both documents → Decision-Step too
hesitant with corrections
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 13 / 16
A-I-PoCoTo
Combine the automatic post-correction with the interactive
post-correction of PoCoTo (work in progress).
Users review and approve (reject) the additional lexicon entries of the
extended lexicon.
Users can inspect all correction decisions carried out (or not carried
out) and revert (or apply) them
The trained base models for the automatic post-correction can be
further improved with the manually corrected document
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 14 / 16
Resume
Automatic post-correction can improve accuracy
Lexicon Extension step does not help → leave out or use only after
manual inspection
Feature-based re-ranking step improves the ranking of the profiler
Automatic post-correction is too cautious → change training of
Decision step to make it more courageous
Automatic post-correction can support the interactive post-correction
General problem: different alphabets between OCR-engines, Profiler
(and ground-truth)
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 15 / 16
A-I-PoCoTo — Combining Automated and Interactive
OCR Postcorrection
Tobias Englmeier, Florian Fink and Klaus U. Schulz
9. May 2019
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 16 / 16

More Related Content

PPTX
Semantic scaffolds for pseudocode to-code generation (2020)
PDF
Cp viva q&a
PPTX
Automated Program Repair, Distinguished lecture at MPI-SWS
PDF
Development And Testing Of Navigation Algorithms For Autonomous Underwater Ve...
PDF
SFScon19 - Alexander Jacob - openEO
PDF
Link Discovery Tutorial Part V: Hands-On
PDF
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
PDF
Session6 01.helmut schmid
Semantic scaffolds for pseudocode to-code generation (2020)
Cp viva q&a
Automated Program Repair, Distinguished lecture at MPI-SWS
Development And Testing Of Navigation Algorithms For Autonomous Underwater Ve...
SFScon19 - Alexander Jacob - openEO
Link Discovery Tutorial Part V: Hands-On
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Session6 01.helmut schmid

More from IMPACT Centre of Competence (20)

PDF
Session1 03.hsian-an wang
PDF
Session7 03.katrien depuydt
PDF
Session7 02.peter kiraly
PDF
Session6 04.giuseppe celano
PDF
Session6 03.sandra young
PDF
Session6 02.jeremi ochab
PDF
Session5 04.evangelos varthis
PDF
Session5 03.george rehm
PDF
Session5 02.tom derrick
PDF
Session5 01.rutger vankoert
PDF
Session4 04.senka drobac
PDF
Session3 04.arnau baro
PDF
Session3 03.christian clausner
PDF
Session3 02.kimmo ketunnen
PDF
Session3 01.clemens neudecker
PDF
Session2 04.ashkan ashkpour
PDF
Session2 03.juri opitz
PDF
Session2 02.christian reul
PDF
Session2 01.emad mohamed
PDF
Session1 02.anna-maria sichani
Session1 03.hsian-an wang
Session7 03.katrien depuydt
Session7 02.peter kiraly
Session6 04.giuseppe celano
Session6 03.sandra young
Session6 02.jeremi ochab
Session5 04.evangelos varthis
Session5 03.george rehm
Session5 02.tom derrick
Session5 01.rutger vankoert
Session4 04.senka drobac
Session3 04.arnau baro
Session3 03.christian clausner
Session3 02.kimmo ketunnen
Session3 01.clemens neudecker
Session2 04.ashkan ashkpour
Session2 03.juri opitz
Session2 02.christian reul
Session2 01.emad mohamed
Session1 02.anna-maria sichani
Ad

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Mushroom cultivation and it's methods.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
MIND Revenue Release Quarter 2 2025 Press Release
A comparative analysis of optical character recognition models for extracting...
Tartificialntelligence_presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
1 - Historical Antecedents, Social Consideration.pdf
A comparative study of natural language inference in Swahili using monolingua...
1. Introduction to Computer Programming.pptx
Mushroom cultivation and it's methods.pdf
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25-Week II
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Web App vs Mobile App What Should You Build First.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Chapter 5: Probability Theory and Statistics
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Ad

Session1 04.florian fink

  • 1. A-I-PoCoTo — Combining Automated and Interactive OCR Postcorrection Tobias Englmeier, Florian Fink and Klaus U. Schulz 9. May 2019 Florian Fink (CIS) A-I-PoCoTo 9. May 2019 1 / 16
  • 2. Overview Automatic post-correction (A-PoCoTo) Evaluation results Automatic interactive post-correction (A-I-PoCoTo) Resume Florian Fink (CIS) A-I-PoCoTo 9. May 2019 2 / 16
  • 3. A-PoCoTo Automatic post-correction of OCR-results of historical documents using supervised machine learning. Multiple OCRs (OCR1, OCR2, . . . , OCRn) can be used 3 steps with two profiling rounds 3 classifiers for 1, 2, . . . , n OCRs Classifiers are trained using logistic regression Developed as a module of the OCR-D project 1 1 http://www.ocr-d.de/ Florian Fink (CIS) A-I-PoCoTo 9. May 2019 3 / 16
  • 4. PoCoTo PoCoTo (Post-Correction Tool) is a tool for manual interactive post-correction of OCRed historical Documents. Initially a desktop application (2014)2 New version as web-application (2017) Profiling used for error detection and correction suggestions Batch correction of (error-) patterns 2 Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., & Schulz, K. U. (2014, May). PoCoTo-an open source system for efficient interactive postcorrection of OCRed historical texts. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (pp. 57-61). ACM. Florian Fink (CIS) A-I-PoCoTo 9. May 2019 4 / 16
  • 5. Profile (global) Given an OCRed historical text, the profiling derives a ‘statistical picture’ (guess) of the language in the document using various background lexica3 OCR errors and OCR error series Historical patterns of the form mod → hist (t → th, ei → ey, ...) Underlying modern words The profile is used as a feature generator for the automatic post-correction system 3 Reffle, U., & Ringlstetter, C. (2013). Unsupervised profiling of OCRed historical documents. Pattern Recognition, 46(5), 1346-1357. Florian Fink (CIS) A-I-PoCoTo 9. May 2019 5 / 16
  • 6. Profile (local) Profiling associates with each token wocr of a document a set of interpretations wmod,cand →α whist,cand →β wocr is generated. α- (historical patterns) and β- (OCR-errors) channels can be empty Interpretations have a weight Each wocr has a ranked set of interpretations wcand,hist Florian Fink (CIS) A-I-PoCoTo 9. May 2019 6 / 16
  • 7. A-PoCoTo Alignment Profiling Lexicon Extension Profiling Ranking Decision A-I-PoCoTo OCR1 OCR2 OCRn Florian Fink (CIS) A-I-PoCoTo 9. May 2019 7 / 16
  • 8. Multiple OCRs One master-OCR Additional support-OCRs (optional) OCRs are token-wise aligned with the master-OCR Each wocr has n − 1 additional OCR-tokens wocr2 , wocr3 , . . . , wocrn Florian Fink (CIS) A-I-PoCoTo 9. May 2019 8 / 16
  • 9. A-PoCoTo — Lexicon Extension step In the lexicon extension step a classifier tries to find good wocr to extend the profiler’s back-end resources. Classification starts after the first profiling round wocr with a non empty α or β channel are considered Set of features for each wocr (token-shape, candidate set, unigram frequencies, agreeing OCRs, . . . ) Classify wocr as True or False True tokens are put into the extended lexicon for the second profiler round Florian Fink (CIS) A-I-PoCoTo 9. May 2019 9 / 16
  • 10. A-PoCoTo — Ranking step In the Ranking step the profiler’s candidates are re-ranked. Classification starts after the second profiling round All whist,cand for each wocr are considered Set of features for each whist,cand (token-shape, candidate unigram frequencies, agreeing OCRs, . . . ) Classifier classifies whist,cand as True or False Candidates are re-ranked using the classifier’s confidence values ([−1, 1]) Florian Fink (CIS) A-I-PoCoTo 9. May 2019 10 / 16
  • 11. A-PoCoTo — Decision step In the Decision step a classifier decides if the best ranked candidate for any wocr should be used as a correction for wocr . Re-ranked candidate set for each wocr are considered Confidence for highest candidate and distance to next candidate are the features Classifier classifies highest ranked candidate as True or False True candidates are used to correct the corresponding wocr Florian Fink (CIS) A-I-PoCoTo 9. May 2019 11 / 16
  • 12. A-PoCoTo — Evaluation results Post-correction model trained on OCR-D4 ground truth Documents from 16th to 19th century 574 pages from 90 documents (3-6 pages per doc.) Profile for each document separately Evaluated two documents: ‘1557, Bodenstein, WieSichMeniglich’ (20 pages) ‘1841, Die Grenzboten’ (50 pages) Four experiments: 1LE (Only master OCR) 1noLE (Only master OCR, LE step omitted) 2LE (One additional support OCR) 2noLE (One additional support OCR, LE step omitted) 4 http://www.ocr-d.de/gt Florian Fink (CIS) A-I-PoCoTo 9. May 2019 12 / 16
  • 13. A-PoCoTo — Evaluation results 2noLE provided best improvement of accuracy: ‘1557, Bodenstein, WieSichMeniglich’: OCR word accuracy: 65,63% → 69,81% ‘1841, Die Grenzboten’: OCR word accuracy: 77,57% → 80,63% Lexicon Extension does not offer benefit Ranking step help finding the best correction candidate Support OCR’s offer improvements (if not combined with LE step) Too many lost chances in both documents → Decision-Step too hesitant with corrections Florian Fink (CIS) A-I-PoCoTo 9. May 2019 13 / 16
  • 14. A-I-PoCoTo Combine the automatic post-correction with the interactive post-correction of PoCoTo (work in progress). Users review and approve (reject) the additional lexicon entries of the extended lexicon. Users can inspect all correction decisions carried out (or not carried out) and revert (or apply) them The trained base models for the automatic post-correction can be further improved with the manually corrected document Florian Fink (CIS) A-I-PoCoTo 9. May 2019 14 / 16
  • 15. Resume Automatic post-correction can improve accuracy Lexicon Extension step does not help → leave out or use only after manual inspection Feature-based re-ranking step improves the ranking of the profiler Automatic post-correction is too cautious → change training of Decision step to make it more courageous Automatic post-correction can support the interactive post-correction General problem: different alphabets between OCR-engines, Profiler (and ground-truth) Florian Fink (CIS) A-I-PoCoTo 9. May 2019 15 / 16
  • 16. A-I-PoCoTo — Combining Automated and Interactive OCR Postcorrection Tobias Englmeier, Florian Fink and Klaus U. Schulz 9. May 2019 Florian Fink (CIS) A-I-PoCoTo 9. May 2019 16 / 16