Session2 02.christian reul

Automatic Semantic Text Tagging on Historical Lexica
by Combining OCR and Typography Classification
A Case Study on Daniel Sanders‘ Wörterbuch der Deutschen Sprache
Christian Reul1, Sebastian Göttel2, Uwe Springmann3,
Christoph Wick1, Kay-Michael Würzner2, and Frank Puppe1
1Chair for Artificial Intelligence and Applied Computer Science; University of Würzburg
2Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)
3Center for Information and Language Processing (CIS); LMU Munich
09.05.2019

 Great progress in the area of historical OCR on various materials.
 But raw textual OCR sometimes not sufficient.
 Typography within a lexicon
 represents semantic meaning.
 encodes a complex structure within the text (lemmata, definitions, grammatical
information, references, possible word formations, …).
 Goal: Thoroughly indexing of a historical lexicon by combining textual
OCR and typography classification.
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
1
Motivation

 Treat the problem as two individual sequence classification tasks:
 Textual OCR.
 Typography classification.
 Perform GT production, training, and recognition separately.
 Combine the results afterwards.
 Assign a distinct label to each of the typography classes:
Image: Hello World
OCR: Hello World
Typo: nnnnn bbbbb
 Use this representation to train an open source OCR engine.
2
Basic Idea

 Wörterbuch der deutschen
Sprache by famous German
lexicographer Daniel Sanders
(turning 200 this November).
 Cooperation with the Berlin-
Brandenburgische Academy of
Sciences and Humanities (BBAW).
 Printed between 1859 and 1865.
 Three part-volumes comprising almost
3,000 pages and ca. 800,000 text lines.
 Excellent print and scan quality.
3
Material: Sanders’ Dictionary I

 Main lemmata always bold Fraktur
(assigned label of typographical class l).
 Followed by grammatical properties in
Antiqua (a).
 Definitions in Fraktur (f).
 Typeface of the quotations divided in
 the authors name, different Fraktur type (n),
 the page number (a).
 Possible word formations in
letter-spacing (F).
4
Material: Sanders’ Dictionary II

 Binarisation and deskewing (ocropus-nlbin).
 Column segmentation.
 Simple whitespace-based approach.
 https://github.com/wrznr/column-detect
 Deskew columns separately (ocropus-nlbin).
 Line segmentation (ocropus-gpageseg).
 Keep rotational angles and segment/line
coordinates for later use.
5
Preprocessing and Segmentation

 Open Source OCR engine Calamari.
 https://github.com/Calamari-OCR
 Great recognition capabilities (CNN-LSTM) and very fast (GPU support).
 Natively supports accuracy improving techniques (see below).
 Voting:
 Train model ensemble instead of a single model.
 Combine outputs via confidence voting.
 Better recognition results.
 Pretraining:
 Start training from an existing model instead from scratch.
 Faster training and better recognition results.
6
OCR Basics

 Manually transcribing the typography GT
cumbersome and error prone.
 Observation: The typography does not change
within a word.
 Idea: Use the OCR GT and label all characters
of a word at once.
 Example (to the right):
 Input (at the top): OCR GT and the line image.
 Transcription steps:
(1) The first word is highlighted and labelled at once.
(2-4) Repeating step 1 for the next words.
(5) All remaining words can be labelled in one go.
(6) Final OCR and typography GT result.
7
Ground Truth Production

 Voting ensemble consisting of five Calamari models.
 Highly performant mixed Fraktur model as a starting point:
 https://github.com/chreul/19th-century-fraktur-OCR
 Able to recognize 93 distinct characters (Sanders contains over 150).
 Calamari extended recognition output for each character:
 Voted probability for the most likely character and its top alternatives.
 Start and end positions.
8
Training and Recognition

 Alignment on word level:
Assign typography output to the words
based on the character positions.
 Typography voting:
Identify most likely label for each word
by confidence voting.
 Final output: JSON file containing:
 OCR and typography label for each word.
 A words minimal character confidence.
 Segment, line, and word bounding boxes
with respect to the original scan.
9
Combining the Outputs
Typography alignment for an example line. From top to bottom:
Line image with OCR whitespace positions (|). Textual OCR output.
Typography output with character positions ( ').
(Slightly flawed) textual typography output on character level.
Final combined output with typography classes assigned on word level.

 Full set of training GT: 765 lines.
 Subsets (400, 200, 100, 50 lines) to examine the influence of the number of GT lines.
 Evaluations set: six columns comprising 630 lines.
 OCR: Character Error Rate (CER) calculated using Calamari’s eval script.
 Typography does not change within a word → Word Error Rate (WER) makes sense:
 Collapse each word in the voted output to a single character.
 Remove all whitespaces.
 Example: aaaa ffffffffff fff nnnnn ffff → affnf.
 Calculate CER using analogously preprocessed GT.
10
Experiments – Data and Performance Measures

 More training lines → lower CER.
 Excellent CER of 0.35% when training
on all available lines.
 Most frequent errors: insertions and
deletions of whitespaces.
 Standard approaches cannot deal
with the peculiarities of the material.
11
Experiments – OCR
# Lines Calamari ABBYY
- 3.69% 10.28%
50 1.83% -
100 1.05% -
200 0.67% -
400 0.43% -
765 0.35% -

 More training lines → lower WER.
 Correct typography label
assigned to over 98.5% of words.
 Data augmentation yields minor
improvements (1.38% WER).
 Most frequent errors insertions
and deletions of words resulting
from misrecognized whitespaces.
 Short words especially
susceptible to errors.
12
Experiments – Typography
GT Pred. Count Perc.
f 15 10.0%
f 15 10.0%
a 10 6.7%
a f 6 4.0%
f F 4 2.7%
# Lines WER
50 9.82%
100 4.08%
200 2.66%
400 1.72%
765 1.47%

 Typography recognition possible and very precise.
 Despite several very similar typography classes.
 Flexible approach using an open source OCR engine.
 Efficient GT production method.
 Main problem: insertion and deletion of whitespaces.
 Typography in Sanders’ dictionary ambiguous.
 Subsequent rule-based postprocessing step required
to produce TEI output.
 Enables complex search queries like: “show all
lemmata which include Goethe as a source”.
13
Discussion

14
Example from the Online Dictionary (work in progress)

 Successful case study on a challen-
ging real world dictionary.
 Hope / Aim: Generic workflow to
obtain complete electronic repre-
sentations of (historical) lexica.
 Further experiments needed (other lexica, different typographical attributes).
 Meta learner judging whitespaces proposed by the OCR and typography models.
 Type-specific OCR models to further increase the accuracy.
 Application on word instead of line level.
 Already promising results.
15
Conclusion and Future Work
Schweizerisches Idiotikon (https://www.idiotikon.ch)

Calamari: https://github.com/Calamari-OCR
 OCR4all: https://github.com/OCR4all
 GT Production: https://github.com/ChWick/ocrgtannotator
 Reul, Springmann, Wick, Puppe: Improving OCR Accuracy on Early Printed Books
by combining Pretraining, Voting, and Active Learning.
 Ul-Hasan, Afzal, Shafait, Liwicki, Breuel: A Sequence Learning Approach for
Multiple Script Identification.
 Wick, Reul, Puppe: Comparison of OCR Accuracy on Early Printed Books using
the Open Source Engines Calamari and OCRopus.
16
Thank you for your Attention!

18
Word and Type Statistics
Length # Perc.
1 711 5.8%
2 1,622 13.2%
3 2,883 23.6%
4 1,754 14.3%
5 1,254 10.2%
>5 4,018 32.8%
a f F l N All
Words
2,754 8,066 363 469 589 12,241
22.5% 65.9% 3.0% 3.8% 4.8% 100.0%
Chars
8,365 40,682 2,636 3,416 3,424 58,523
14.3% 69.5% 4.5% 5.8% 5.9% 100.0%
Length 3.04 5.04 7.26 7.28 5.81 4.78

19
OCR Errors per Type
Type a f F l n
GT 2,580 21,936 747 1,768 333
50
76
2.95%
260
1.19%
36
4.82%
22
6.61%
98
5.54%
200
16
0.62%
64
0.29%
16
2.14%
17
5.11%
51
2.88%
765
4
0.16%
40
0.18%
17
2.28%
8
2.40%
25
1.41%

20
Typography – Error Analysis
Postprocessing yields no noteworthy improvements on character level.
 Unexpected since: ffffnff → fffffff.
 Dominant errors: insertions and deletions.
 Missed whitespaces can introduce errors:
aaaaffff → aaaaaaaa (GT: aaaa ffff)

 Using the OCRopus3 ocrodeg module:
 Data augmentation improves the results.
 The more augmentations the better –
but saturation quickly kicks in.
 The less real lines available the bigger the effect.
21
Typography – Data Augmentation

22
OCR4all
 Goal: enable non-technical users to
independently capture historical
printings with high accuracy.
 Encapsulating a comprehensive OCR
workflows in a single Docker image.
 Plattform-independent.
 Easy installation.
 Incorporating open source solutions
(OCRopus, Calamari, LAREX, …).
 Comfortable usage (Web-GUI).
 https://github.com/OCR4all

Session2 02.christian reul

More Related Content

Similar to Session2 02.christian reul (16)

More from IMPACT Centre of Competence (20)

Recently uploaded (20)

Session2 02.christian reul