SlideShare a Scribd company logo
Text Extraction from Infographics
Ansgar Scherp,
Kiel University and ZBW – Leibniz Information Centre for Economics, Germany
Falk Böschen,
Kiel University, Germany
LWA 2015 (KDML), Trier, Germany
Infographics Challenges
• Text with different font sizes
• Text with varying emphasis
• Text in different colors
• Text on different background colors
• Text rotated at different angles
• Text occluded by graphic elements
Slide [ 01 / 18 ]Falk Böschen and Ansgar Scherp
Initial presentation at [DocEng’15] → Now: Improve comparability and extensibility
Abstract Pipeline Idea
 Input: Information Graphic
1. RE: Extract regions from graphic
2. RC: Cluster regions into text and non-text elements
3. LC: Computation of text lines for orientation estimation
4. PRE: Preprocessing of text elements for OCR
5. OCR: Optical Character Recognition
6. POST: Post-correction of OCR result
 Output: Text
Slide [ 02 / 18 ]Falk Böschen and Ansgar Scherp
Region
Extraction
Region
Clustering
TextLine
Computa-tion
Preprocessing
OCR
Postprocessing
RE RC LC PRE POSTOCR
Excerpt of Related Work
Authors Title RE RC LC Pre OCR Post
Chiang & Knoblock Recognizing text in raster maps ✍ ✍ ✔ ✔ ✔ ✔
Jayant et al. Automated tactile graphics translation: in the field ✍ ✍ ✍ ✔ ✔
Sas & Zolnierek Three-Stage Method of Text Region Extraction from Diagram
Raster Images
✔ ✔ ✔ ✔
Huang et al. Associating Text and Graphics for Scientific Chart
Understanding
✔ ✍ ✔ ✔ ✔ ✍
Lu et al. Automated analysis of images in documents for intelligent
document search
✔ ✔ ✔
Xu & Krauthammer A New Pivoting and Iterative Text Detection Algorithm for
Biomedical Images
✔ ✔
Chen et al. DiagramFlyer: A Search Engine for Data-Driven Diagrams ? ? ? ? ? ?
Böschen & Scherp Multi-oriented Text Extraction from Information Graphics ✔ ✔ ✔ ✔ ✔
Gllavata et al. Adaptive Fuzzy Text Segmentation in Images with Complex
Backgrounds Using Color and Texture
✔ ✔
Fraz et al. Exploiting colour information for better scene text detection
and recognition
✔ ✔ ✔ ✔ ✔
Liu & Samarabandu Multiscale Edge-Based Text Extraction from Complex Images ✔ ✔
Olszewska Active contour based optical character recognition for
automated scene understanding
✔ ✔ ✔
Lu et al. Scene text extraction based on edges and support vector
regression
✔ ✔ ✍
Slide [ 03 / 18 ]Falk Böschen and Ansgar Scherp
Example: Adaptive Binarization and Labeling
• Binarization based on
Otsu‘s method
• Extended by hierarchical
computation using edge
images for split-decision
• Connected Component
Labeling with 8-neighbors
• Noise removal by region
size thresholding
Slide [ 04 / 18 ]Falk Böschen and Ansgar Scherp
Example: Grouping Regions
Slide [ 05 / 18 ]Falk Böschen and Ansgar Scherp
• Number of clusters
unknown
• Text is “dense”
→ DBSCAN
• DBSCAN does not
necessarily produce text
lines which are required
for reliable orientation
estimation
𝑓 =
𝑥
𝑦
𝑤
ℎ
𝑟
Example: Computing Text Lines
Slide [ 06 / 18 ]Falk Böschen and Ansgar Scherp
• Compute a Minimum
Spanning Tree for each
DBSCAN Cluster using
a reduced feature vector
• Split each MST
(if necessary) by using
the edge orientations
𝑓′ =
𝑥
𝑦
Example: Estimating the Orientation of Text Lines
Slide [ 07 / 18 ]Falk Böschen and Ansgar Scherp
• Transform the center of mass coordinates of each element of every
cluster into a discretized Hough space (one for each cluster)
→ a line/curve for each center of mass in Hough space
• Hough space discretized to 180 degree in 1 degree steps
• Find maximal value to obtain orientation of cluster
Maximum
Example: Rotating Text Lines and Applying OCR
Slide [ 08 / 18 ]Falk Böschen and Ansgar Scherp
• Cut each text element
out of the original image
• Rotate it accordingly to
the estimated angle
• Send it to an OCR
engine for recognition
• Reasonable OCR
engine: Tesseract (also
used in Google Books)
Ground Truth Generation
Falk Böschen and Ansgar Scherp Slide [ 09 / 18 ]
Evaluation Setup
Slide [ 10 / 18 ]Falk Böschen and Ansgar Scherp
item 1 Item 1
{e, i, m, t, 1}
{em, it, te}
{ite, tem}
{e, m, t, I, 1}
{em, te, It}
{tem, Ite}
Unigrams
Bigrams
Trigrams
Preliminary Evaluation Setup: Baselines
Baseline #1:
• OCR engine Tesseract with layout analysis
• Single execution on the whole infographic
Baseline #2:
• OCR engine Tesseract with layout analysis
• Multiple executions on the whole infographic at various angles
• Merging of the different executions results
+ + + +
Slide [ 11 / 18 ]Falk Böschen and Ansgar Scherp
Bilder oder Grafik
Slide [ 12 / 18 ]Falk Böschen and Ansgar Scherp
Evaluation Set: 121 Infographics (Domain Economics)
Dataset/Result set Characteristics
# 1-grams # 2-grams # 3-grams # Words Word Length
TX Pipeline AVG : 177.20
SD : 128.20
AVG : 127.34
SD : 100.51
AVG : 89.34
SD : 79.35
AVG : 50.07
SD : 31.95
AVG : 3.63
SD : 2.69
Baseline #1 AVG : 106.30
SD : 87.71
AVG : 80.17
SD : 69.12
AVG : 60.79
SD : 54.54
AVG : 25.21
SD : 22.12
AVG : 4.15
SD : 2.25
Baseline #2 AVG : 135.08
SD : 125.56
AVG : 100.20
SD : 98.20
AVG : 75.08
SD : 78.10
AVG : 35.25
SD : 33.94
AVG : 4.08
SD : 1.95
Ground Truth AVG : 150.65
SD : 122.28
AVG : 115.93
SD : 103.09
AVG : 84.95
SD : 85.61
AVG : 35.46
SD : 22.24
AVG : 4.22
SD : 1.48
Slide [ 13 / 18 ]Falk Böschen and Ansgar Scherp
# 1-grams # 2-grams # 3-grams # Words Word Length
TX Pipeline AVG : 177.20
SD : 128.20
AVG : 127.34
SD : 100.51
AVG : 89.34
SD : 79.35
AVG : 50.07
SD : 31.95
AVG : 3.63
SD : 2.69
Baseline #1 AVG : 106.30
SD : 87.71
AVG : 80.17
SD : 69.12
AVG : 60.79
SD : 54.54
AVG : 25.21
SD : 22.12
AVG : 4.15
SD : 2.25
Baseline #2 AVG : 135.08
SD : 125.56
AVG : 100.20
SD : 98.20
AVG : 75.08
SD : 78.10
AVG : 35.25
SD : 33.94
AVG : 4.08
SD : 1.95
Ground Truth AVG : 150.65
SD : 122.28
AVG : 115.93
SD : 103.09
AVG : 84.95
SD : 85.61
AVG : 35.46
SD : 22.24
AVG : 4.22
SD : 1.48
• Our pipeline extracts more characters and words than present in the data
→Increased chance to recognize all the textual information
• The baselines extract less characters and words than present in the data
→Obviously miss some text components
• There is a high standard deviation in general
→Infographics are very heterogeneous
Preliminary Evaluation Results
n-gram Precision Recall F1-measure
TX Pipeline 1
2
3
AVG: 0.50 SD: 0.41
AVG: 0.58 SD: 0.39
AVG: 0.52 SD: 0.39
AVG: 0.68 SD: 0.36
AVG: 0.54 SD: 0.38
AVG: 0.48 SD: 0.37
AVG: 0.47 SD: 0.39
AVG: 0.54 SD: 0.34
AVG: 0.49 SD: 0.37
Baseline #1 1
2
3
AVG: 0.37 SD: 0.36
AVG: 0.42 SD: 0.33
AVG: 0.42 SD: 0.31
AVG: 0.48 SD: 0.36
AVG: 0.42 SD: 0.34
AVG: 0.42 SD: 0.31
AVG: 0.36 SD: 0.35
AVG: 0.42 SD: 0.33
AVG: 0.36 SD: 0.33
Relative
Improvement
1
2
3
35.14 %
38.10 %
23.81 %
41.67 %
28.57 %
14.29 %
30.06 %
28.57 %
36.11 %
Slide [ 14 / 18 ]Falk Böschen and Ansgar Scherp
n-gram Precision Recall F1-measure
TX Pipeline 1
2
3
AVG: 0.50 SD: 0.41
AVG: 0.58 SD: 0.39
AVG: 0.52 SD: 0.39
AVG: 0.68 SD: 0.36
AVG: 0.54 SD: 0.38
AVG: 0.48 SD: 0.37
AVG: 0.47 SD: 0.39
AVG: 0.54 SD: 0.34
AVG: 0.49 SD: 0.37
Baseline #2 1
2
3
AVG: 0.37 SD: 0.37
AVG: 0.42 SD: 0.34
AVG: 0.42 SD: 0.32
AVG: 0.51 SD: 0.38
AVG: 0.42 SD: 0.35
AVG: 0.42 SD: 0.32
AVG: 0.36 SD: 0.36
AVG: 0.42 SD: 0.34
AVG: 0.42 SD: 0.32
Relative
Improvement
1
2
3
35.14 %
38.10 %
23.81 %
33.33 %
28.57 %
14.29 %
30.06 %
28.57 %
16.67 %
Preliminary Evaluation: Orientation Distributions
Here horizontal equals ±15° based on Tesseracts rotation tolerances
Falk Böschen and Ansgar Scherp Slide [ 15 / 18 ]
Preliminary Evaluation: Levenshtein Distance
Slide [ 16 / 18 ]Falk Böschen and Ansgar Scherp
Extreme Examples
Best Result Worst Result
Falk Böschen and Ansgar Scherp Slide [ 17 / 18 ]
P/R/F TX BL1 BL2
Unigram 0.95/0.95/0.95 0.02/0.26/0.02 0.02/0.26/0.02
Bigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00
Trigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00
Levenshtein 0.14 3.69 3.21
P/R/F TX BL1 BL2
Unigram 0.02/0.45/0.02 0.00/0.00/0.00 0.00/0.00/0.00
Bigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00
Trigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00
Levenshtein 3.47 0.14 0.14
Conclusion and Future Work
 Conclusion
• Automated pipeline for text extraction from infographics
• Independent of infographic type (no special knowledge required)
 Future Work
• Improvements necessary for individual/broken characters,
occlusion, dotted lines, shading, super-/subscripts, …
• Make different approaches comparable (implementations)
• Improved evaluation framework for different configurations
• Test of alternative OCR engines
• Expanding the ground truth set for extensive evaluation
Falk Böschen and Ansgar Scherp Slide [ 18 / 18 ]
Questions?
Ansgar Scherp
ZBW – Leibniz Information
Centre for Economics
and Kiel University
Germany
asc@informatik.uni-kiel.de
Falk Böschen
Kiel University
Germany
fboe@informatik.uni-kiel.de
http://www.kd.informatik.uni-kiel.de/en
The Road Ahead …
Falk Böschen and Ansgar Scherp
Phase 1: Text Line Localization
Structure of our Text Extraction Pipeline
Adaptive
Binarization
and Labeling
Grouping
Regions into
Text Elements
Computing of
Text Lines
Estimating the
Orientation of
Text Lines
Rotation of
Text Lines and
Applying OCR
Evaluation
Phase 2: Text Extraction and Evaluation
Falk Böschen and Ansgar Scherp
Otsu‘s Method
Input Image Output Image
Source: https://en.wikipedia.org/wiki/Otsu's_method
• Assumes two classes of pixels following bi-modal histogram (foreground
pixels and background pixels)
• Calculates the optimum threshold separating the two classes so that their
combined spread (intra-class variance) is minimal / that their inter-class
variance is maximal
• Extension of the original method to multi-level thresholding exist
Falk Böschen and Ansgar Scherp

More Related Content

PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
PPTX
Mining and Managing Large-scale Linked Open Data
PPTX
A Comparison of Different Strategies for Automated Semantic Document Annotation
PDF
Knowledge Discovery in Social Media and Scientific Digital Libraries
PDF
Real-Time Big Data Stream Analytics
PDF
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
PPTX
Data Stream Algorithms in Storm and R
Mining and Managing Large-scale Linked Open Data
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Mining and Managing Large-scale Linked Open Data
A Comparison of Different Strategies for Automated Semantic Document Annotation
Knowledge Discovery in Social Media and Scientific Digital Libraries
Real-Time Big Data Stream Analytics
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Data Stream Algorithms in Storm and R

What's hot (20)

PPTX
Streaming Algorithms
PDF
Mining Big Data Streams with APACHE SAMOA
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
MOA for the IoT at ACML 2016
PPT
5.1 mining data streams
PPTX
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PDF
Artificial intelligence and data stream mining
PDF
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
PDF
Mining Big Data in Real Time
PPTX
Mining high speed data streams: Hoeffding and VFDT
PDF
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PDF
Signals from outer space
PDF
Moa: Real Time Analytics for Data Streams
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
LSH for
 Prediction Problem in Recommendation
Streaming Algorithms
Mining Big Data Streams with APACHE SAMOA
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
MOA for the IoT at ACML 2016
5.1 mining data streams
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Artificial intelligence and data stream mining
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Mining Big Data in Real Time
Mining high speed data streams: Hoeffding and VFDT
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Efficient Online Evaluation of Big Data Stream Classifiers
Signals from outer space
Moa: Real Time Analytics for Data Streams
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
LSH for
 Prediction Problem in Recommendation
Ad

Viewers also liked (6)

PDF
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
PDF
A Framework for Iterative Signing of Graph Data on the Web
PDF
Smart photo selection: interpret gaze as personal interest
PPTX
Events in Multimedia - Theory, Model, Application
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
SchemEX -- Building an Index for Linked Open Data
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
A Framework for Iterative Signing of Graph Data on the Web
Smart photo selection: interpret gaze as personal interest
Events in Multimedia - Theory, Model, Application
Ad

Similar to Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics (20)

PDF
Deep Learning Introduction - WeCloudData
PPT
Aocr Hmm Presentation
PPT
Topic 6 Graphic Transformation and Viewing.ppt
PPTX
Next-generation sequencing format and visualization with ngs.plot
PDF
Enhanced characterness for text detection in the wild
PDF
Decision Forests and discriminant analysis
PDF
Implementation of Computer Vision Applications using OpenCV in C++
PPTX
Thesis presentation
PPTX
150807 Fast R-CNN
PPTX
Introduction to R for Learning Analytics Researchers
PPTX
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
PDF
Temporal Superpixels Based on Proximity-Weighted Patch Matching
PPTX
Detecting text from natural images with Stroke Width Transform
PDF
Temporal Segment Network
PPT
License Plate Recognition
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
PDF
Creating a Custom Serialization Format (Gophercon 2017)
PPT
Miniproject final group 14
PPTX
Tomoya Sato Master Thesis
PDF
TOOD: Task-aligned One-stage Object Detection
Deep Learning Introduction - WeCloudData
Aocr Hmm Presentation
Topic 6 Graphic Transformation and Viewing.ppt
Next-generation sequencing format and visualization with ngs.plot
Enhanced characterness for text detection in the wild
Decision Forests and discriminant analysis
Implementation of Computer Vision Applications using OpenCV in C++
Thesis presentation
150807 Fast R-CNN
Introduction to R for Learning Analytics Researchers
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Temporal Superpixels Based on Proximity-Weighted Patch Matching
Detecting text from natural images with Stroke Width Transform
Temporal Segment Network
License Plate Recognition
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
Creating a Custom Serialization Format (Gophercon 2017)
Miniproject final group 14
Tomoya Sato Master Thesis
TOOD: Task-aligned One-stage Object Detection

More from Ansgar Scherp (10)

PPTX
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
PDF
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
PDF
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
PPTX
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
PPTX
Can you see it? Annotating Image Regions based on Users' Gaze Information
PPTX
Linked open data - how to juggle with more than a billion triples
PPTX
SchemEX -- Building an Index for Linked Open Data
PPTX
A Model of Events for Integrating Event-based Information in Complex Socio-te...
PPTX
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
PPTX
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
Can you see it? Annotating Image Regions based on Users' Gaze Information
Linked open data - how to juggle with more than a billion triples
SchemEX -- Building an Index for Linked Open Data
A Model of Events for Integrating Event-based Information in Complex Socio-te...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...

Recently uploaded (20)

PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
innovation process that make everything different.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Testing WebRTC applications at scale.pdf
PPTX
artificial intelligence overview of it and more
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
DOCX
Unit-3 cyber security network security of internet system
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
SAP Ariba Sourcing PPT for learning material
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
international classification of diseases ICD-10 review PPT.pptx
Paper PDF World Game (s) Great Redesign.pdf
innovation process that make everything different.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Testing WebRTC applications at scale.pdf
artificial intelligence overview of it and more
Unit-1 introduction to cyber security discuss about how to secure a system
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Sims 4 Historia para lo sims 4 para jugar
RPKI Status Update, presented by Makito Lay at IDNOG 10
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
SASE Traffic Flow - ZTNA Connector-1.pdf
PptxGenJS_Demo_Chart_20250317130215833.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps
Unit-3 cyber security network security of internet system
presentation_pfe-universite-molay-seltan.pptx
SAP Ariba Sourcing PPT for learning material
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
international classification of diseases ICD-10 review PPT.pptx

Formalization and Preliminary Evaluation of a Pipeline for Text Extraction From Infographics

  • 1. Text Extraction from Infographics Ansgar Scherp, Kiel University and ZBW – Leibniz Information Centre for Economics, Germany Falk Böschen, Kiel University, Germany LWA 2015 (KDML), Trier, Germany
  • 2. Infographics Challenges • Text with different font sizes • Text with varying emphasis • Text in different colors • Text on different background colors • Text rotated at different angles • Text occluded by graphic elements Slide [ 01 / 18 ]Falk Böschen and Ansgar Scherp Initial presentation at [DocEng’15] → Now: Improve comparability and extensibility
  • 3. Abstract Pipeline Idea  Input: Information Graphic 1. RE: Extract regions from graphic 2. RC: Cluster regions into text and non-text elements 3. LC: Computation of text lines for orientation estimation 4. PRE: Preprocessing of text elements for OCR 5. OCR: Optical Character Recognition 6. POST: Post-correction of OCR result  Output: Text Slide [ 02 / 18 ]Falk Böschen and Ansgar Scherp Region Extraction Region Clustering TextLine Computa-tion Preprocessing OCR Postprocessing RE RC LC PRE POSTOCR
  • 4. Excerpt of Related Work Authors Title RE RC LC Pre OCR Post Chiang & Knoblock Recognizing text in raster maps ✍ ✍ ✔ ✔ ✔ ✔ Jayant et al. Automated tactile graphics translation: in the field ✍ ✍ ✍ ✔ ✔ Sas & Zolnierek Three-Stage Method of Text Region Extraction from Diagram Raster Images ✔ ✔ ✔ ✔ Huang et al. Associating Text and Graphics for Scientific Chart Understanding ✔ ✍ ✔ ✔ ✔ ✍ Lu et al. Automated analysis of images in documents for intelligent document search ✔ ✔ ✔ Xu & Krauthammer A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images ✔ ✔ Chen et al. DiagramFlyer: A Search Engine for Data-Driven Diagrams ? ? ? ? ? ? Böschen & Scherp Multi-oriented Text Extraction from Information Graphics ✔ ✔ ✔ ✔ ✔ Gllavata et al. Adaptive Fuzzy Text Segmentation in Images with Complex Backgrounds Using Color and Texture ✔ ✔ Fraz et al. Exploiting colour information for better scene text detection and recognition ✔ ✔ ✔ ✔ ✔ Liu & Samarabandu Multiscale Edge-Based Text Extraction from Complex Images ✔ ✔ Olszewska Active contour based optical character recognition for automated scene understanding ✔ ✔ ✔ Lu et al. Scene text extraction based on edges and support vector regression ✔ ✔ ✍ Slide [ 03 / 18 ]Falk Böschen and Ansgar Scherp
  • 5. Example: Adaptive Binarization and Labeling • Binarization based on Otsu‘s method • Extended by hierarchical computation using edge images for split-decision • Connected Component Labeling with 8-neighbors • Noise removal by region size thresholding Slide [ 04 / 18 ]Falk Böschen and Ansgar Scherp
  • 6. Example: Grouping Regions Slide [ 05 / 18 ]Falk Böschen and Ansgar Scherp • Number of clusters unknown • Text is “dense” → DBSCAN • DBSCAN does not necessarily produce text lines which are required for reliable orientation estimation 𝑓 = 𝑥 𝑦 𝑤 ℎ 𝑟
  • 7. Example: Computing Text Lines Slide [ 06 / 18 ]Falk Böschen and Ansgar Scherp • Compute a Minimum Spanning Tree for each DBSCAN Cluster using a reduced feature vector • Split each MST (if necessary) by using the edge orientations 𝑓′ = 𝑥 𝑦
  • 8. Example: Estimating the Orientation of Text Lines Slide [ 07 / 18 ]Falk Böschen and Ansgar Scherp • Transform the center of mass coordinates of each element of every cluster into a discretized Hough space (one for each cluster) → a line/curve for each center of mass in Hough space • Hough space discretized to 180 degree in 1 degree steps • Find maximal value to obtain orientation of cluster Maximum
  • 9. Example: Rotating Text Lines and Applying OCR Slide [ 08 / 18 ]Falk Böschen and Ansgar Scherp • Cut each text element out of the original image • Rotate it accordingly to the estimated angle • Send it to an OCR engine for recognition • Reasonable OCR engine: Tesseract (also used in Google Books)
  • 10. Ground Truth Generation Falk Böschen and Ansgar Scherp Slide [ 09 / 18 ]
  • 11. Evaluation Setup Slide [ 10 / 18 ]Falk Böschen and Ansgar Scherp item 1 Item 1 {e, i, m, t, 1} {em, it, te} {ite, tem} {e, m, t, I, 1} {em, te, It} {tem, Ite} Unigrams Bigrams Trigrams
  • 12. Preliminary Evaluation Setup: Baselines Baseline #1: • OCR engine Tesseract with layout analysis • Single execution on the whole infographic Baseline #2: • OCR engine Tesseract with layout analysis • Multiple executions on the whole infographic at various angles • Merging of the different executions results + + + + Slide [ 11 / 18 ]Falk Böschen and Ansgar Scherp
  • 13. Bilder oder Grafik Slide [ 12 / 18 ]Falk Böschen and Ansgar Scherp Evaluation Set: 121 Infographics (Domain Economics)
  • 14. Dataset/Result set Characteristics # 1-grams # 2-grams # 3-grams # Words Word Length TX Pipeline AVG : 177.20 SD : 128.20 AVG : 127.34 SD : 100.51 AVG : 89.34 SD : 79.35 AVG : 50.07 SD : 31.95 AVG : 3.63 SD : 2.69 Baseline #1 AVG : 106.30 SD : 87.71 AVG : 80.17 SD : 69.12 AVG : 60.79 SD : 54.54 AVG : 25.21 SD : 22.12 AVG : 4.15 SD : 2.25 Baseline #2 AVG : 135.08 SD : 125.56 AVG : 100.20 SD : 98.20 AVG : 75.08 SD : 78.10 AVG : 35.25 SD : 33.94 AVG : 4.08 SD : 1.95 Ground Truth AVG : 150.65 SD : 122.28 AVG : 115.93 SD : 103.09 AVG : 84.95 SD : 85.61 AVG : 35.46 SD : 22.24 AVG : 4.22 SD : 1.48 Slide [ 13 / 18 ]Falk Böschen and Ansgar Scherp # 1-grams # 2-grams # 3-grams # Words Word Length TX Pipeline AVG : 177.20 SD : 128.20 AVG : 127.34 SD : 100.51 AVG : 89.34 SD : 79.35 AVG : 50.07 SD : 31.95 AVG : 3.63 SD : 2.69 Baseline #1 AVG : 106.30 SD : 87.71 AVG : 80.17 SD : 69.12 AVG : 60.79 SD : 54.54 AVG : 25.21 SD : 22.12 AVG : 4.15 SD : 2.25 Baseline #2 AVG : 135.08 SD : 125.56 AVG : 100.20 SD : 98.20 AVG : 75.08 SD : 78.10 AVG : 35.25 SD : 33.94 AVG : 4.08 SD : 1.95 Ground Truth AVG : 150.65 SD : 122.28 AVG : 115.93 SD : 103.09 AVG : 84.95 SD : 85.61 AVG : 35.46 SD : 22.24 AVG : 4.22 SD : 1.48 • Our pipeline extracts more characters and words than present in the data →Increased chance to recognize all the textual information • The baselines extract less characters and words than present in the data →Obviously miss some text components • There is a high standard deviation in general →Infographics are very heterogeneous
  • 15. Preliminary Evaluation Results n-gram Precision Recall F1-measure TX Pipeline 1 2 3 AVG: 0.50 SD: 0.41 AVG: 0.58 SD: 0.39 AVG: 0.52 SD: 0.39 AVG: 0.68 SD: 0.36 AVG: 0.54 SD: 0.38 AVG: 0.48 SD: 0.37 AVG: 0.47 SD: 0.39 AVG: 0.54 SD: 0.34 AVG: 0.49 SD: 0.37 Baseline #1 1 2 3 AVG: 0.37 SD: 0.36 AVG: 0.42 SD: 0.33 AVG: 0.42 SD: 0.31 AVG: 0.48 SD: 0.36 AVG: 0.42 SD: 0.34 AVG: 0.42 SD: 0.31 AVG: 0.36 SD: 0.35 AVG: 0.42 SD: 0.33 AVG: 0.36 SD: 0.33 Relative Improvement 1 2 3 35.14 % 38.10 % 23.81 % 41.67 % 28.57 % 14.29 % 30.06 % 28.57 % 36.11 % Slide [ 14 / 18 ]Falk Böschen and Ansgar Scherp n-gram Precision Recall F1-measure TX Pipeline 1 2 3 AVG: 0.50 SD: 0.41 AVG: 0.58 SD: 0.39 AVG: 0.52 SD: 0.39 AVG: 0.68 SD: 0.36 AVG: 0.54 SD: 0.38 AVG: 0.48 SD: 0.37 AVG: 0.47 SD: 0.39 AVG: 0.54 SD: 0.34 AVG: 0.49 SD: 0.37 Baseline #2 1 2 3 AVG: 0.37 SD: 0.37 AVG: 0.42 SD: 0.34 AVG: 0.42 SD: 0.32 AVG: 0.51 SD: 0.38 AVG: 0.42 SD: 0.35 AVG: 0.42 SD: 0.32 AVG: 0.36 SD: 0.36 AVG: 0.42 SD: 0.34 AVG: 0.42 SD: 0.32 Relative Improvement 1 2 3 35.14 % 38.10 % 23.81 % 33.33 % 28.57 % 14.29 % 30.06 % 28.57 % 16.67 %
  • 16. Preliminary Evaluation: Orientation Distributions Here horizontal equals ±15° based on Tesseracts rotation tolerances Falk Böschen and Ansgar Scherp Slide [ 15 / 18 ]
  • 17. Preliminary Evaluation: Levenshtein Distance Slide [ 16 / 18 ]Falk Böschen and Ansgar Scherp
  • 18. Extreme Examples Best Result Worst Result Falk Böschen and Ansgar Scherp Slide [ 17 / 18 ] P/R/F TX BL1 BL2 Unigram 0.95/0.95/0.95 0.02/0.26/0.02 0.02/0.26/0.02 Bigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00 Trigram 0.92/0.92/0.92 0.00/0.00/0.00 0.00/0.00/0.00 Levenshtein 0.14 3.69 3.21 P/R/F TX BL1 BL2 Unigram 0.02/0.45/0.02 0.00/0.00/0.00 0.00/0.00/0.00 Bigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00 Trigram 0.00/0.00/0.00 0.00/0.00/0.00 0.00/0.00/0.00 Levenshtein 3.47 0.14 0.14
  • 19. Conclusion and Future Work  Conclusion • Automated pipeline for text extraction from infographics • Independent of infographic type (no special knowledge required)  Future Work • Improvements necessary for individual/broken characters, occlusion, dotted lines, shading, super-/subscripts, … • Make different approaches comparable (implementations) • Improved evaluation framework for different configurations • Test of alternative OCR engines • Expanding the ground truth set for extensive evaluation Falk Böschen and Ansgar Scherp Slide [ 18 / 18 ]
  • 20. Questions? Ansgar Scherp ZBW – Leibniz Information Centre for Economics and Kiel University Germany asc@informatik.uni-kiel.de Falk Böschen Kiel University Germany fboe@informatik.uni-kiel.de http://www.kd.informatik.uni-kiel.de/en
  • 21. The Road Ahead … Falk Böschen and Ansgar Scherp
  • 22. Phase 1: Text Line Localization Structure of our Text Extraction Pipeline Adaptive Binarization and Labeling Grouping Regions into Text Elements Computing of Text Lines Estimating the Orientation of Text Lines Rotation of Text Lines and Applying OCR Evaluation Phase 2: Text Extraction and Evaluation Falk Böschen and Ansgar Scherp
  • 23. Otsu‘s Method Input Image Output Image Source: https://en.wikipedia.org/wiki/Otsu's_method • Assumes two classes of pixels following bi-modal histogram (foreground pixels and background pixels) • Calculates the optimum threshold separating the two classes so that their combined spread (intra-class variance) is minimal / that their inter-class variance is maximal • Extension of the original method to multi-level thresholding exist Falk Böschen and Ansgar Scherp

Editor's Notes

  • #5: No uniform use of terms: Biomedical Image Topographic/Geographic/Raster Map Scientific Chart Chart Image Chart Diagram Diagram [Raster] Image Information Graphic Infographic Mathematical/Scholarly Figure Flow/Pie/Bar/Column Chart Column/Bar/Line Graph 2D Plot Scatterplot No (automated) complete pipeline from infographic to text described Technical description in many cases insufficient for reproduction Comparison is difficult due to missing formalization
  • #6: In computer vision and image processing, Otsu's method, named after Nobuyuki Otsu (大津展之 Ōtsu Nobuyuki?), is used to automatically perform clustering-based image thresholding,[1] or, the reduction of a graylevel image to a binary image. The algorithm assumes that the image contains two classes of pixels following bi-modal histogram (foreground pixels and background pixels), it then calculates the optimum threshold separating the two classes so that their combined spread (intra-class variance) is minimal, or equivalently (because the sum of pairwise squared distances is constant), so that their inter-class variance is maximal.[2] Consequently, Otsu's method is roughly a one-dimensional, discrete analog of Fisher's Discriminant Analysis. The extension of the original method to multi-level thresholding is referred to as the Multi Otsu method. https://en.wikipedia.org/wiki/Otsu's_method