Session1 04.florian fink

A-I-PoCoTo — Combining Automated and Interactive
OCR Postcorrection
Tobias Englmeier, Florian Fink and Klaus U. Schulz
9. May 2019
Florian Fink (CIS) A-I-PoCoTo 9. May 2019 1 / 16

Overview
Automatic post-correction (A-PoCoTo)
Evaluation results
Automatic interactive post-correction (A-I-PoCoTo)
Resume

A-PoCoTo
Automatic post-correction of OCR-results of historical documents using
supervised machine learning.
Multiple OCRs (OCR1, OCR2, . . . , OCRn) can be used
3 steps with two profiling rounds
3 classifiers for 1, 2, . . . , n OCRs
Classifiers are trained using logistic regression
Developed as a module of the OCR-D project 1
1
http://www.ocr-d.de/

PoCoTo
PoCoTo (Post-Correction Tool) is a tool for manual interactive
post-correction of OCRed historical Documents.
Initially a desktop application (2014)2
New version as web-application (2017)
Profiling used for error detection and correction suggestions
Batch correction of (error-) patterns
2
Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., & Schulz, K. U. (2014, May).
PoCoTo-an open source system for efficient interactive postcorrection of OCRed
historical texts. In Proceedings of the First International Conference on Digital Access
to Textual Cultural Heritage (pp. 57-61). ACM.

Profile (global)
Given an OCRed historical text, the profiling derives a ‘statistical picture’
(guess) of the language in the document using various background lexica3
OCR errors and OCR error series
Historical patterns of the form mod → hist (t → th, ei → ey, ...)
Underlying modern words
The profile is used as a feature generator for the automatic
post-correction system
3
Reffle, U., & Ringlstetter, C. (2013). Unsupervised profiling of OCRed historical
documents. Pattern Recognition, 46(5), 1346-1357.

Proﬁle (local)
Proﬁling associates with each token wocr of a document a set of
interpretations wmod,cand →α whist,cand →β wocr is generated.
α- (historical patterns) and β- (OCR-errors) channels can be empty
Interpretations have a weight
Each wocr has a ranked set of interpretations wcand,hist

A-PoCoTo
Alignment Proﬁling
Lexicon
Extension
Proﬁling Ranking Decision
A-I-PoCoTo
OCR1 OCR2 OCRn

Multiple OCRs
One master-OCR
Additional support-OCRs (optional)
OCRs are token-wise aligned with the master-OCR
Each wocr has n − 1 additional OCR-tokens wocr2 , wocr3 , . . . , wocrn

A-PoCoTo — Lexicon Extension step
In the lexicon extension step a classifier tries to find good wocr to extend
the profiler’s back-end resources.
Classification starts after the first profiling round
wocr with a non empty α or β channel are considered
Set of features for each wocr (token-shape, candidate set, unigram
frequencies, agreeing OCRs, . . . )
Classify wocr as True or False
True tokens are put into the extended lexicon for the second profiler
round

A-PoCoTo — Ranking step
In the Ranking step the profiler’s candidates are re-ranked.
Classification starts after the second profiling round
All whist,cand for each wocr are considered
Set of features for each whist,cand (token-shape, candidate unigram
frequencies, agreeing OCRs, . . . )
Classifier classifies whist,cand as True or False
Candidates are re-ranked using the classifier’s confidence values
([−1, 1])

A-PoCoTo — Decision step
In the Decision step a classifier decides if the best ranked candidate for
any wocr should be used as a correction for wocr .
Re-ranked candidate set for each wocr are considered
Confidence for highest candidate and distance to next candidate are
the features
Classifier classifies highest ranked candidate as True or False
True candidates are used to correct the corresponding wocr

A-PoCoTo — Evaluation results
Post-correction model trained on OCR-D4 ground truth
Documents from 16th to 19th century
574 pages from 90 documents (3-6 pages per doc.)
Proﬁle for each document separately
Evaluated two documents:
‘1557, Bodenstein, WieSichMeniglich’ (20 pages)
‘1841, Die Grenzboten’ (50 pages)
Four experiments:
1LE (Only master OCR)
1noLE (Only master OCR, LE step omitted)
2LE (One additional support OCR)
2noLE (One additional support OCR, LE step omitted)
4
http://www.ocr-d.de/gt

A-PoCoTo — Evaluation results
2noLE provided best improvement of accuracy:
‘1557, Bodenstein, WieSichMeniglich’:
OCR word accuracy: 65,63% → 69,81%
‘1841, Die Grenzboten’:
OCR word accuracy: 77,57% → 80,63%
Lexicon Extension does not offer benefit
Ranking step help finding the best correction candidate
Support OCR’s offer improvements (if not combined with LE step)
Too many lost chances in both documents → Decision-Step too
hesitant with corrections

A-I-PoCoTo
Combine the automatic post-correction with the interactive
post-correction of PoCoTo (work in progress).
Users review and approve (reject) the additional lexicon entries of the
extended lexicon.
Users can inspect all correction decisions carried out (or not carried
out) and revert (or apply) them
The trained base models for the automatic post-correction can be
further improved with the manually corrected document

Resume
Automatic post-correction can improve accuracy
Lexicon Extension step does not help → leave out or use only after
manual inspection
Feature-based re-ranking step improves the ranking of the profiler
Automatic post-correction is too cautious → change training of
Decision step to make it more courageous
Automatic post-correction can support the interactive post-correction
General problem: different alphabets between OCR-engines, Profiler
(and ground-truth)

A-I-PoCoTo — Combining Automated and Interactive
OCR Postcorrection
Tobias Englmeier, Florian Fink and Klaus U. Schulz
9. May 2019

Session1 04.florian fink

More Related Content

More from IMPACT Centre of Competence (20)

Recently uploaded (20)

Session1 04.florian fink