Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

Handwritten and Machine Printed Text
Separation in Document Images using
the Bag of Visual Words Paradigm
Konstantinos Zagoris1,2, Ioannis Pratikakis2, Apostolos
Antonacopoulos1, Basilis Gatos3, Nikos Papamarkos2

1Pattern Recognition and Image Analysis (PRImA) Research Lab
School of Computing, Science and Engineering,
University of Salford, Greater Manchester, UK

2Department of Electrical and Computer Engineering
Democritus University of Thrace, Xanthi, Greece

3Institute
of Informatics and Telecommunications,
National Center for Scientific Research “Demokritos” Athens, Greece

Current state-of-the-art
Three (3) main approaches
 Text Line Level
 Word Level
 Character Level

Disadvantages
 Different
Page Segmentation Algorithms
 Incompatible Feature Set

Inspired from Information
Bag of Visual Word •
Retrieval Theory
Model (BoVWM) • An image content is
described by a set of “visual
words”.
• A “visual word” is expressed
by a group of local features
• Most well-known local
feature is the Scale-Invariant
Feature Transform (SIFT)
• Codebook Creation
• A codebook is defined by
the set of the clusters
• A “visual word” is denoted
as the vector which
represents the center of
each cluster
• Codebook is analogous to a
dictionary

Each visual entity is
Bag of Visual Word •
described by a BoVWM
Model (BoVWM) descriptor
• Each SIFT point belongs to a
“visual word”
• The “visual word” that
corresponds to the closest
center of the cluster by a
distance function
(Euclidean, Manhattan)
• The descriptor reflects the
frequency of each visual
word that appears in the
image.

Proposed Method

Block
Descriptor
Original Page
Extraction Classification Final Result
Image Segmentation
(BoVW
model)

Page Segmentation
1. B. Gatos, I. Pratikakis, and
S. Perantonis. Adaptive
degraded document image
binarization. Pattern
Original
Image Recognition, 39(3):317–
327, 2006.

2. N. Nikolaou, M. Makridis, B. G
atos, N. Stamatopoulos, and
N. Papamarkos.
Locally Adaptive Binarisation Method [1] Segmentation of historical
machine-printed documents
using adaptive run length
smoothing and skeleton
segmentation paths. Image
and Vision
Adaptive Run Length Smoothing Algorithm [2] Computing, 28(4):590–
604, 2010.

Final Result

Block Descriptor Extraction
 This step involves the creation of the block
descriptor by utilizing the BoVW model
 Codebook Properties
 It must be small enough to ensure a low
computational cost. It must be large enough
to provide sufficiently high discrimination
performance
 For the clustering stage the k-means algorithm
is employed due to its simplicity and speed.

Block Descriptor
An example text block
Extraction
• The SIFTs are calculated on
the greyscale version
Initial SIFT keypoints • those SIFTs whose position in
the binary image does not
match the foreground pixel
are rejected
• Each of the remaining local
features is assigned a Visual
Word from the Codebook
• a Visual Word Descriptor is
formed based on the
Final SIFT keypoints appearance of each Visual
Word of the Codebook in
this particular block

Decision System
 a classifier decides if the block contains
handwritten or machine printed text or neither of
the above (noise)
 Based on the Support Vector Machines (SVMs)
 Conventional approach – one against one, one
against others
 Train two SVMs with the Radial Basis Function (RBF)
kernel
 The first (SVM1) deals with the handwritten text
problem against all the other
 the second (SVM2) deals with the machine printed
text problem against all the other.

Decision System Algorithm

Support Vector

D1
D2
Sample

Sample Support Vector

SVM1 (Handwritten Text) SVM2 (Machine-printed Text)

Examples

Original Image

Output of the
proposed method

Evaluation Datasets
 103 modified document images from the
IAM Handwriting Database
 100 representative images selected from
the index cards of the UK Natural History
Museum’s card archive (PRImA-NHM)
 The ground truth files adhere to the Page
Analysis and Ground-truth Elements
(PAGE) format framework
 http://datasets.primaresearch.org

Evaluation
The F-measure of each method.
Dataset IAM PRImA-
NHM
Upper Bound (Proposed
Segmentation) 0.9887 0.7985
Proposed Method (Proposed
Segmentation and BoVW) 0.9886 0.7689
Gabor Filters (Proposed
Segmentation and Gabor Filters) 0.7921 0.5702

Page Segmentation Problems
 Binarization Failures

 Noise – Text Overlapping

 Handwritten – Machine text Overlapping

Thank You!

Ευχαριστώ!

Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

More Related Content

What's hot (19)

Similar to Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm (20)

Recently uploaded (20)

Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm