Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics

Text Extraction from Infographics
Ansgar Scherp,
Kiel University and ZBW – Leibniz Information Centre for Economics, Germany
Falk Böschen,
Kiel University, Germany
LWA 2015 (KDML), Trier, Germany

Infographics Challenges
• Text with different font sizes
• Text with varying emphasis
• Text in different colors
• Text on different background colors
• Text rotated at different angles
• Text occluded by graphic elements
Slide [ 01 / 18 ]Falk Böschen and Ansgar Scherp
Initial presentation at [DocEng’15] → Now: Improve comparability and extensibility

Abstract Pipeline Idea
 Input: Information Graphic
1. RE: Extract regions from graphic
2. RC: Cluster regions into text and non-text elements
3. LC: Computation of text lines for orientation estimation
4. PRE: Preprocessing of text elements for OCR
5. OCR: Optical Character Recognition
6. POST: Post-correction of OCR result
 Output: Text
Region
Extraction
Region
Clustering
TextLine
Computa-tion
Preprocessing
OCR
Postprocessing
RE RC LC PRE POSTOCR

Excerpt of Related Work
Authors Title RE RC LC Pre OCR Post
Chiang & Knoblock Recognizing text in raster maps ✍ ✍ ✔ ✔ ✔ ✔
Jayant et al. Automated tactile graphics translation: in the field ✍ ✍ ✍ ✔ ✔
Sas & Zolnierek Three-Stage Method of Text Region Extraction from Diagram
Raster Images
✔ ✔ ✔ ✔
Huang et al. Associating Text and Graphics for Scientific Chart
Understanding
✔ ✍ ✔ ✔ ✔ ✍
Lu et al. Automated analysis of images in documents for intelligent
document search
✔ ✔ ✔
Xu & Krauthammer A New Pivoting and Iterative Text Detection Algorithm for
Biomedical Images
✔ ✔
Chen et al. DiagramFlyer: A Search Engine for Data-Driven Diagrams ? ? ? ? ? ?
Böschen & Scherp Multi-oriented Text Extraction from Information Graphics ✔ ✔ ✔ ✔ ✔
Gllavata et al. Adaptive Fuzzy Text Segmentation in Images with Complex
Backgrounds Using Color and Texture
✔ ✔
Fraz et al. Exploiting colour information for better scene text detection
and recognition
✔ ✔ ✔ ✔ ✔
Liu & Samarabandu Multiscale Edge-Based Text Extraction from Complex Images ✔ ✔
Olszewska Active contour based optical character recognition for
automated scene understanding
✔ ✔ ✔
Lu et al. Scene text extraction based on edges and support vector
regression
✔ ✔ ✍

Example: Adaptive Binarization and Labeling
• Binarization based on
Otsu‘s method
• Extended by hierarchical
computation using edge
images for split-decision
• Connected Component
Labeling with 8-neighbors
• Noise removal by region
size thresholding

Example: Grouping Regions
• Number of clusters
unknown
• Text is “dense”
→ DBSCAN
• DBSCAN does not
necessarily produce text
lines which are required
for reliable orientation
estimation
𝑓 =
𝑥
𝑦
𝑤
ℎ
𝑟

Example: Computing Text Lines
• Compute a Minimum
Spanning Tree for each
DBSCAN Cluster using
a reduced feature vector
• Split each MST
(if necessary) by using
the edge orientations
𝑓′ =
𝑥
𝑦

Example: Estimating the Orientation of Text Lines
• Transform the center of mass coordinates of each element of every
cluster into a discretized Hough space (one for each cluster)
→ a line/curve for each center of mass in Hough space
• Hough space discretized to 180 degree in 1 degree steps
• Find maximal value to obtain orientation of cluster
Maximum

Example: Rotating Text Lines and Applying OCR
• Cut each text element
out of the original image
• Rotate it accordingly to
the estimated angle
• Send it to an OCR
engine for recognition
• Reasonable OCR
engine: Tesseract (also
used in Google Books)

Ground Truth Generation
Falk Böschen and Ansgar Scherp Slide [ 09 / 18 ]

Evaluation Setup
item 1 Item 1
{e, i, m, t, 1}
{em, it, te}
{ite, tem}
{e, m, t, I, 1}
{em, te, It}
{tem, Ite}
Unigrams
Bigrams
Trigrams

Preliminary Evaluation Setup: Baselines
Baseline #1:
• OCR engine Tesseract with layout analysis
• Single execution on the whole infographic
Baseline #2:
• OCR engine Tesseract with layout analysis
• Multiple executions on the whole infographic at various angles
• Merging of the different executions results
+ + + +

Bilder oder Grafik
Evaluation Set: 121 Infographics (Domain Economics)

Dataset/Result set Characteristics
# 1-grams # 2-grams # 3-grams # Words Word Length
TX Pipeline AVG : 177.20
SD : 128.20
AVG : 127.34
SD : 100.51
AVG : 89.34
SD : 79.35
AVG : 50.07
SD : 31.95
AVG : 3.63
SD : 2.69
Baseline #1 AVG : 106.30
SD : 87.71
AVG : 80.17
SD : 69.12
AVG : 60.79
SD : 54.54
AVG : 25.21
SD : 22.12
AVG : 4.15
SD : 2.25
SD : 125.56
AVG : 100.20
SD : 98.20
AVG : 75.08
SD : 78.10
AVG : 35.25
SD : 33.94
AVG : 4.08
SD : 1.95
Ground Truth AVG : 150.65
SD : 122.28
AVG : 115.93
SD : 103.09
AVG : 84.95
SD : 85.61
AVG : 35.46
SD : 22.24
AVG : 4.22
SD : 1.48
# 1-grams # 2-grams # 3-grams # Words Word Length
TX Pipeline AVG : 177.20
SD : 128.20
AVG : 127.34
SD : 100.51
AVG : 89.34
SD : 79.35
AVG : 50.07
SD : 31.95
AVG : 3.63
SD : 2.69
SD : 87.71
AVG : 80.17
SD : 69.12
AVG : 60.79
SD : 54.54
AVG : 25.21
SD : 22.12
AVG : 4.15
SD : 2.25
SD : 125.56
AVG : 100.20
SD : 98.20
AVG : 75.08
SD : 78.10
AVG : 35.25
SD : 33.94
AVG : 4.08
SD : 1.95
Ground Truth AVG : 150.65
SD : 122.28
AVG : 115.93
SD : 103.09
AVG : 84.95
SD : 85.61
AVG : 35.46
SD : 22.24
AVG : 4.22
SD : 1.48
• Our pipeline extracts more characters and words than present in the data
→Increased chance to recognize all the textual information
• The baselines extract less characters and words than present in the data
→Obviously miss some text components
• There is a high standard deviation in general
→Infographics are very heterogeneous

Preliminary Evaluation Results
n-gram Precision Recall F1-measure
TX Pipeline 1
2
3
AVG: 0.50 SD: 0.41
AVG: 0.58 SD: 0.39
AVG: 0.52 SD: 0.39
AVG: 0.68 SD: 0.36
AVG: 0.54 SD: 0.38
AVG: 0.48 SD: 0.37
AVG: 0.47 SD: 0.39
AVG: 0.54 SD: 0.34
AVG: 0.49 SD: 0.37
Baseline #1 1
2
3
AVG: 0.37 SD: 0.36
AVG: 0.42 SD: 0.33
AVG: 0.42 SD: 0.31
AVG: 0.48 SD: 0.36
AVG: 0.42 SD: 0.34
AVG: 0.42 SD: 0.31
AVG: 0.36 SD: 0.35
AVG: 0.42 SD: 0.33
AVG: 0.36 SD: 0.33
Relative
Improvement
1
2
3
35.14 %
38.10 %
23.81 %
41.67 %
28.57 %
14.29 %
30.06 %
28.57 %
36.11 %
n-gram Precision Recall F1-measure
TX Pipeline 1
2
3
AVG: 0.50 SD: 0.41
AVG: 0.58 SD: 0.39
AVG: 0.52 SD: 0.39
AVG: 0.68 SD: 0.36
AVG: 0.54 SD: 0.38
AVG: 0.48 SD: 0.37
AVG: 0.47 SD: 0.39
AVG: 0.54 SD: 0.34
AVG: 0.49 SD: 0.37
Baseline #2 1
2
3
AVG: 0.37 SD: 0.37
AVG: 0.42 SD: 0.34
AVG: 0.42 SD: 0.32
AVG: 0.51 SD: 0.38
AVG: 0.42 SD: 0.35
AVG: 0.42 SD: 0.32
AVG: 0.36 SD: 0.36
AVG: 0.42 SD: 0.34
AVG: 0.42 SD: 0.32
Relative
Improvement
1
2
3
35.14 %
38.10 %
23.81 %
33.33 %
28.57 %
14.29 %
30.06 %
28.57 %
16.67 %

Preliminary Evaluation: Orientation Distributions
Here horizontal equals ±15° based on Tesseracts rotation tolerances

Preliminary Evaluation: Levenshtein Distance

Extreme Examples
Best Result Worst Result
P/R/F TX BL1 BL2
Unigram 0.95/0.95/0.95 0.02/0.26/0.02 0.02/0.26/0.02
Bigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00
Trigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00
Levenshtein 0.14 3.69 3.21
P/R/F TX BL1 BL2
Unigram 0.02/0.45/0.02 0.00/0.00/0.00 0.00/0.00/0.00
Bigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00
Trigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00
Levenshtein 3.47 0.14 0.14

Conclusion and Future Work
 Conclusion
• Automated pipeline for text extraction from infographics
• Independent of infographic type (no special knowledge required)
 Future Work
• Improvements necessary for individual/broken characters,
occlusion, dotted lines, shading, super-/subscripts, …
• Make different approaches comparable (implementations)
• Improved evaluation framework for different configurations
• Test of alternative OCR engines
• Expanding the ground truth set for extensive evaluation

Questions?
Ansgar Scherp
ZBW – Leibniz Information
Centre for Economics
and Kiel University
Germany
asc@informatik.uni-kiel.de
Falk Böschen
Kiel University
Germany
fboe@informatik.uni-kiel.de
http://www.kd.informatik.uni-kiel.de/en

The Road Ahead …
Falk Böschen and Ansgar Scherp

Phase 1: Text Line Localization
Structure of our Text Extraction Pipeline
Adaptive
Binarization
and Labeling
Grouping
Regions into
Text Elements
Computing of
Text Lines
Estimating the
Orientation of
Text Lines
Rotation of
Text Lines and
Applying OCR
Evaluation
Phase 2: Text Extraction and Evaluation

Otsu‘s Method
Input Image Output Image
Source: https://en.wikipedia.org/wiki/Otsu's_method
• Assumes two classes of pixels following bi-modal histogram (foreground
pixels and background pixels)
• Calculates the optimum threshold separating the two classes so that their
combined spread (intra-class variance) is minimal / that their inter-class
variance is maximal
• Extension of the original method to multi-level thresholding exist

Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics (20)

More from Ansgar Scherp (10)

Recently uploaded (20)

Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics

Editor's Notes