SlideShare a Scribd company logo
A simple text-line segmentation method
                             for handwritten documents
                          M.Ravi Kumar1, R. Pradeep2, B.S.Puneeth Kumar3, and Prasad Babu4
                             Dept. of P.G.Studies in Computer Science, Kuvempu University
                                        Jnana Sahyadri, Shankaraghatta-577451
                 E-mail: ravi2142@yahoo.co.in, punithashp@gmail.com, kamitkar.prasad9@gmail.com


Abstract:
                   Text line segmentation is an important step       mainly due to the irregularity of lay- out and character shapes
because inaccurately segmented text lines will cause errors in       originated from the variability of writing styles. For
the recognition stage.. The nature of handwriting makes the          unconstrained handwritten documents, text line seg- mentation
process of text line segmentation very challenging. Text             and character segmentation-recognition are not solved though
characteristics can vary in font, size, orientation, alignment,      enormous efforts have been devoted to them and great
color, contrast, and background information. These variations        advances have been made. Text line segmentation of
turn the process of word detection complex and difficult.            handwritten documents is much more difficult than that of
Since handwritten text can vary greatly depending on the user        printed documents. Unlike that printed documents have
skills, disposition and cultural background. In this work we         approximately straight and parallel text lines, the lines in
have proposed the method which works on the different                handwritten documents are often un-uniformly skewed and
intensity values for extracting the text-lines.                      curved. Moreover, the spaces between handwritten text lines
                                                                     are often not obvious compared to the spaces between within-
                                                                     line characters, and some text lines may interfere with each
Keywords:                                                            other. Therefore, many text line detection techniques, such as
                                                                     projection analysis and K-nearest neighbor connected
Handwritten document, segmentation, text-lines;                      components (CCs) grouping, are not able to segment
                                                                     handwritten text lines successfully[2].
1. Introduction:                                                     On the other hand, piece-wise projections are sensitive to
                                                                     character’s size variation within text lines and significant gaps
                       Segmentation of a document image into         between successive words. These occurrences influence the
its basic entities, namely, text lines and words, is considered as   effectiveness of smearing methods too. In such cases, the
a non-trivial problem to solve in the field of handwritten           results of two adjacent zones may be ambiguous, affecting the
document recognition. The difficulties that arise in                 drawing of text-line separators along the document width. To
handwritten documents make the segmentation procedure a              deal with these problems we introduce a smooth version of the
challenging task. Different types of difficulties are                projection profiles to over segment each zone into candidate
encountered in the text line segmentation and the word               text and gap regions. Then, we reclassify these regions by
segmentation procedure. In case of text line segmentation            applying an HMM formulation that enhances statistics from
procedure, major difficulties include the difference in the skew     the whole document page. Starting from left and moving to the
angle between lines on the page or even along the same text          right we combine separators of consecutive zones considering
line, overlapping words and adjacent text lines touching.            their proximity and the local foreground density [3].
Furthermore, the frequent appearance of accents in many              These piece-wise projection based methods have a few
languages (e.g. French, Greek) makes the text line                   shortcomings: (a) they generate too many potential separating
segmentation a challenging task. In word segmentation,               lines, (b) the parameter of stripe width is predefined, (c) text-
difficulties that arise include the appearance of skew and slant     lines should not have significant skew as mentioned in and (d)
in the text line, the existence of punctuation marks along the       if there is no potential piece-wise line in the first and last
text line and the non- uniform spacing of words which is a           stripes, drawing a complete separating line will become
common residual in handwritten documents[1]. Text line               impossible in the any algorithms. These shortcomings are
segmentation from document images is one of the major                observed based on an experiment that we have conducted with
problems in document image analysis. It provides crucial             a number of text-pages. Some authors also used skew
information for the tasks of text block segmentation, character      information for text-line separation. In an unconstrained
seg-mentation and recognition, and text string recognition.          handwritten text-page, it is very difficult to detect the
Whereas the difficulty of machine-printed document analysis          orientation of each line on the basis of the skew calculated for
mainly lies in the complex layout structure and degraded             the entire page. Therefore, these methods may not work
image quality, handwritten document analysis is difficult            properly [4].



                                                                                                                                    1
The concept of the Hough transform is employed in the field       densities to assign overlapping CCs into text lines [7].
of document analysis for many purposes such as skew               Experimental results on a collection of 720 documents
detection, line detection, slant detection and text-line          (English, Arabic and children's handwriting) show that
segmentation. The Hough transform is employed for text-line       97.31% of text lines were segmented correctly. The writers
segmentation in different scripts. A block-based Hough            mention that “a more intelligent approach to cut an
transform is presented which is a modification of the             overlapping component is the goal of future work”. A recent
conventional Hough transform methodology. The algorithm           approach [8] uses block-based Hough transform to detect lines
includes partitioning of the connected component domain into      and merging methods to correct false alarms. Although the
three spatial sub-domains and applying a block-based Hough        algorithm achieves a 93.1% detection rate and a 96%
transform to detect the potential text lines [5].                 recognition rate, it is not flexible to follow variation of skew
Many efforts have been devoted to the difficult problem of        angle along the same text line and not very precise in the
hand- written text line segmentation. The methods can be          assignment of accents to text lines. Li et al. [9] discuss the
roughly categorized into three classes: top-down, bottom-up,      text-line detection task as an image segmentation problem.
and hybrid. Top-down methods partition the document image         They use a Gaussian window to convert a binary image into a
recursively into text regions, text lines, and words/characters   smooth gray-scale. Then they adopt the level set method to
with the assumption of straight lines. Bottom-up methods          evolve text-line boundaries and finally, geometrical constrains
group small units of image (pixels, CCs, characters, words,       are imposed to group CCs or segments as text lines. They
etc.) into text lines and then text regions. Bottom-up grouping   report pixel-level hit rates varying from 92% to 98% on
can be viewed as a clustering process, which aggregates image     different scripts and mention that “the major failures happen
components according to proximity and does not rely on the        because two neighboring text lines touch each other
assumption of straight lines. Hybrid methods combine bottom-      significantly”. A similar method [10] evaluates eight different
up grouping and top-down partitioning in different ways. All      spatial measures between pairs of CCs to locate words in
the three approaches have their disadvantages. Top-down           handwritten postal ad- dresses. The best metric proved to be
methods do not perform well on curved and overlapping text        the one which combines the result of the minimum run-length
lines. The performance of bottom-up grouping relies on some       method and the vertical overlapping of two successive CCs.
heuristic rules or artificial parameters, such as the between-    Additionally, this metric is adjusted by utilizing the results of
component distance metric for clustering. On the other hand,      a punctuation detection algorithm (periods and commas).
hybrid methods are complicated in computation, and the            Then, a suitable threshold is computed by an iterative
design of a robust combination scheme is non-trivial [2].         procedure. The algorithm tested on 1000 address images and
A thinning operation has also been used by other researchers      performed an error rate of about 10%.
for text-line segmentation of Japanese and Indian text            Manmatha and Rothfeder [11] propose an effective for noisy
documents. Thinning algorithms followed by post-processing        historical documents scale space approach. The line image is
operations are employed for the entire background region of       filtered with an anisotropic Laplacian at several scales in order
an input text image to detect the separating borderlines.         to produce blobs which correspond to portions of characters at
Recently, some techniques have used level set, active contour     small scales and to words at larger scales. The optimum scale
and a variational Bayes method for text-line segmentation.        is estimated by three different techniques (line height, page
Density estimation and the level set method (LSM) were            averaging and free search) from which the line height showed
utilized for text-line segmentation. A probability map is         best results. Much more challenging task is line segmentation
estimated from an input document image, where each element        in historical documents due to a great deal of noise. Feldbach
represents the probability of the original pixel belonging to a   and Tonnies [12] have proposed a bottom up method for
text line. The level set method (LSM) is utilized to determine    historical church documents that requires parameters to be set
the boundary evolution of neighboring text lines. At first, a     according to the type of handwriting. They report a 90%
matched filter bank approach is used for smoothing the input      correct segmentation rate for constant parameter values which
text image. The central line of text-line components is then      rises to 97% for adjusted ones.
computed using ridges over the smoothed image. Finally, the       Another integrated system for such documents [13] creates a
active contours (snakes) over the ridges are adapted to obtain    foreground/background transition count map to find probable
the text-line segmentation result [4].                            locations of text lines and applies min-cut/max-flow algorithm
                                                                  to separate initially connected text lines .The method performs
2. Related Work:                                                  high accuracy (over 98%) in 20 images of George
                                                                  Washington's manuscript.
In this section, we give a brief review of recent work on text
line and word segmentation in handwritten document images.
As far as we know, the following techniques either achieved       3. Segmentation challenges:
the best results in the corresponding test datasets, or are
elements of integrated systems for specific tasks. One of the     In this section we the challenges involved in the segmentation
most accurate methods uses piece-wise projection pro- files to    of the text-lines. When dealing with handwritten text, line
obtain an initial set of candidate lines and bivariate Gaussian




                                                                                                                                 2
segmentation has to solve some obstacles that are uncommon        4. Motivation:
in modern printed text [6]. Among the most predominant are:       The above challenges motivated us to take this challenging
                                                                  work.
3.1 Skewed lines: Lines of text in general are not straight.
These lines are not parallel to each other.                       5. Proposed Method:

                                                                  In the proposed method our objective is to identify the
                                                                  boundary of the text-lines, which consist two steps (i)
                                                                  Generating partial boundary line and (ii) Generating complete
                                                                  boundary line. By using these partial lines it is very difficult to
                                                                  differentiate between the two lines. Because these partial lines
                                                                  are having gaps and they are broken. Thus we need to
                                                                  construct the complete line which acts as differentiator for
                                                                  identifying the text-lines. In constructing the complete line we
                                                                  will consider the highest frequent y co-ordinate value in each
                                                                  partial boundary lines, y co-ordinate values are the complete
Figure 3.1: Skewed Lines
                                                                  boundaries for each text-lines. In last step we segment the
                                                                  each text-line by representing the each text-line with different
3.2 Fluctuating lines: Lines of text are partially or fully
                                                                  colors.
connected to other text-lines.

                                                                  5.1 Generating partial text-line boundary:

                                                                  The partial text-line boundary is generated by blocking the
                                                                  text-lines. Using morphological operations like erosion. Here
                                                                  blocking is done for filling the holes and gaps between the
                                                                  words. This helps in drawing the partial line. Here the lines
                                                                  are drawn at the edge of the every blocked text-line. However
                                                                  these partial lines are not sufficient to differentiate the text-
                                                                  lines. This is shown in figure 5.1.
Figure 3.2: Fluctuating Lines
                                                                  5.2 Generating complete text-line boundary:
3.3 Line proximity: Small gaps between neighboring text
lines will cause touching and overlapping of components,          As in the previous step we obtained the partial boundary lines,
usually words or letters, between lines and irregularity in       these broken boundary lines are not sufficient for segmenting
geometrical properties of the line, such as line width, height,   the text-lines, thus we need to generate complete line which is
distance in between words and lines, leftmost position etc.       continuous. So now by using these partial lines, we draw the
                                                                  complete boundary lines with the help of the frequent vertical
                                                                  points. These y co-ordinate values are used to differentiate
                                                                  between text-lines, and these complete boundary lines helps in
                                                                  segmentation. This is showed in figure 5.2

                                                                  5.3 Text-line segmentation:

                                                                  Once the complete boundary lines were drawn, it is easy to
                                                                  segment the text-lines by assigning different values to each
                                                                  character in between the lines. In this section, we give the
                                                                  different intensity values for the characters in a text-line
                                                                  between the two drawn lines. Through the different intensity
                                                                  values of the characters in a text-line, this method recognizes
Figure 3.3: Line proximity                                        the different lines. The segmentation for different languages
Source: [6]                                                       are showed in figure 5.3




                                                                                                                                   3
Figure 5.1: Illustration of Partial text-line boundary detection. (a) A handwritten document. (b) Blocked text-
lines. (c) Partial boundary lines




Figure 5.2: Illustration of complete text-line boundary detection. (a) Complete boundary lines with partial
boundary lines. (b) Complete boundary lines without partial boundary lines




                                                                                                                  4
Figure 5.3: Illustration of segmented text-lines. (a) Segmented English document. (b) Segmented Kannada
document (c) Segmented Hindi document (d) Segmented6. Block diagram of proposed method:
                                                         English document




                                                                                                          5
7. Results and discussion:
We conducted the experiments on the languages like Kannada,      [7] Z. Razak, K. Zulkiflee, et al., Off-line handwriting text line
English, Hindi and Arabic of 500 documents (kannada-150,         segmentation: a review, International Journal of Computer
English-200, Hindi-100 and Arabic-50) and obtained the           Science and Network Security 8 (7) (2008) 12–20.
accurate results.                                                [8] G. Louloudis, B. Gatos, C. Halatsis, Text line detection in
                                                                 unconstrained handwritten documents using a block-based
                                                                 Hough transform approach, in: Proceedings of International
8. Conclusion and future work:                                   Conference on Document Analysis and Recognition, 2007, pp.
                                                                 599–603
In this work we are mainly concentrated on extracting required    [9] Y. Li, Y. Zheng, D. Doermann, S. Jaeger, Script-
text-line from the given document and obtained the specified     independent text line segmentation in freestyle handwritten
text-line accurately. The proposed method works well for         documents, IEEE Transactions on Pattern Analysis and
segmenting the text-line of handwritten document. Our            Machine Intelligence 30 (8) (2008) 1313–1329.
method only works on the fixed word length and text-lines        [10] G. Seni, E. Cohen, External word segmentation of off-
without skew; in our future work we will improve the results     line handwritten text lines, Pattern Recognition 27 (1994) 41–
by segmenting the text-lines with above described limitations.   52.
                                                                 [11] R. Manmatha, J.L. Rothfeder, A scale space approach for
9. References:                                                   automatically segmenting words from historical handwritten
                                                                 documents, IEEE Transactions on Pattern Analysis and
[1] G. Louloudisa, B.Gatosb, I.Pratikakisb, C.HalatsisaText      Machine Intelligence 27 (8) (2005) 1212–1225.
line and word segmentation of handwritten documents. Pattern     [12] M. Feldbach, K.D. Tonnies, Line detection and
Recognition (2008) pp. 3169 – 3183                               segmentation in historical church registers, in: Proceedings of
[2] Fei Yin, Cheng-LinLiu. Handwritten Chinese text line         International Conference on Document Analysis and
segmentation by clustering with distance metric learning         Recognition, 2001, pp. 743–747.
Pattern Recognition (2009) pp. 3146 -- 3157                      [13] D.J. Kennard, W.A. Barrett, Separating lines of text in
[3] Vassilis Papavassiliou, Themos Stafylakis, et al..,          free-form handwritten historical documents, in: Proceedings of
Handwritten document image segmentation into textlines and       International Workshop on Document Image Analysis for
words. Pattern Recognition (2010) pp. 369 -- 377                 Libraries, 2006, pp. 12–23.
[4] Alireza Alaei UmapadaPal, et al..,. A new scheme for
unconstrained handwritten text-line segmentation. Pattern
Recognition (2011) pp. 917–928
[5] A Block-Based Hough Transform Mapping for Text Line
Detection in Handwritten Documents G. Louloudis1, B.
Gatos2, I. Pratikakis2, K. Halatsis1
[6] Ashu Kumar , Simpel Rani Jindal, Galaxy Singla Line
segmentation using contour tracing, 2012 Vol 3




                                                                                                                                 6

More Related Content

PDF
BAG OF VISUAL WORDS FOR WORD SPOTTING IN HANDWRITTEN DOCUMENTS BASED ON CURVA...
PDF
IRJET- A Survey on MSER Based Scene Text Detection
PDF
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
PDF
The effect of training set size in authorship attribution: application on sho...
PDF
IRJET- A Survey Paper on Text Summarization Methods
PDF
Cc35451454
PDF
Persian arabic document segmentation based on hybrid approach
PDF
A MODEL TO CONVERT WAVE–FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...
BAG OF VISUAL WORDS FOR WORD SPOTTING IN HANDWRITTEN DOCUMENTS BASED ON CURVA...
IRJET- A Survey on MSER Based Scene Text Detection
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...
The effect of training set size in authorship attribution: application on sho...
IRJET- A Survey Paper on Text Summarization Methods
Cc35451454
Persian arabic document segmentation based on hybrid approach
A MODEL TO CONVERT WAVE–FORM-TEXT TO LINEAR-FORM-TEXT FOR BETTER READABILITY ...

Viewers also liked (16)

PPT
Spotting Customers in Trouble
PDF
A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
DOCX
Word segmentation method for handwritten documents based on structured learning
PDF
Holistic Approach for Arabic Word Recognition
PPTX
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
PDF
CHARACTER RECOGNITION USING NEURAL NETWORK WITHOUT FEATURE EXTRACTION FOR KAN...
PPTX
Segmentation - based Historical Handwritten Word Spotting using document-spec...
PPT
Online handwritten script recognition
PDF
Performance of Statistics Based Line Segmentation System for Unconstrained H...
PPT
Arabic Handwritten Script Recognition Towards Generalization: A Survey
PPT
Devanagari Character Recognition
PPTX
Artificial Neural Network / Hand written character Recognition
PPTX
Text extraction From Digital image
PPT
optical character recognition system
PPTX
Optical Character Recognition( OCR )
DOCX
Hand Written Character Recognition Using Neural Networks
Spotting Customers in Trouble
A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
Word segmentation method for handwritten documents based on structured learning
Holistic Approach for Arabic Word Recognition
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
CHARACTER RECOGNITION USING NEURAL NETWORK WITHOUT FEATURE EXTRACTION FOR KAN...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Online handwritten script recognition
Performance of Statistics Based Line Segmentation System for Unconstrained H...
Arabic Handwritten Script Recognition Towards Generalization: A Survey
Devanagari Character Recognition
Artificial Neural Network / Hand written character Recognition
Text extraction From Digital image
optical character recognition system
Optical Character Recognition( OCR )
Hand Written Character Recognition Using Neural Networks
Ad

Similar to A simple text ine segmentation method for handwritten documents1 (20)

PDF
50120130406021
PDF
F045053236
PDF
2014_IJCCC_Handwritten Documents Text Line Segmentation based on Information ...
PDF
Text content dependent writer identification
PDF
An effective approach to offline arabic handwriting recognition
PDF
En31919926
PDF
Java Abs Online Handwritten Script Recognition
PDF
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
PDF
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
PDF
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
PDF
OFF-LINE ARABIC HANDWRITTEN WORDS SEGMENTATION USING MORPHOLOGICAL OPERATORS
PDF
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
PDF
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
PDF
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
PDF
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
PDF
Online Hand Written Character Recognition
PDF
Text Extraction System by Eliminating Non-Text Regions
PDF
Handwritten character recognition in
PDF
Design and Implementation Recognition System for Handwritten Hindi/Marathi Do...
PPTX
Automatic handwriting recognition
50120130406021
F045053236
2014_IJCCC_Handwritten Documents Text Line Segmentation based on Information ...
Text content dependent writer identification
An effective approach to offline arabic handwriting recognition
En31919926
Java Abs Online Handwritten Script Recognition
Text-Image Separation in Document Images Using Boundary/Perimeter Detection
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
OFF-LINE ARABIC HANDWRITTEN WORDS SEGMENTATION USING MORPHOLOGICAL OPERATORS
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
Off-Line Arabic Handwritten Words Segmentation using Morphological Operators
Online Hand Written Character Recognition
Text Extraction System by Eliminating Non-Text Regions
Handwritten character recognition in
Design and Implementation Recognition System for Handwritten Hindi/Marathi Do...
Automatic handwriting recognition
Ad

Recently uploaded (20)

PDF
Complications of Minimal Access Surgery at WLH
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Basic Mud Logging Guide for educational purpose
PDF
Pre independence Education in Inndia.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Classroom Observation Tools for Teachers
PPTX
Cell Types and Its function , kingdom of life
PDF
Microbial disease of the cardiovascular and lymphatic systems
Complications of Minimal Access Surgery at WLH
GDM (1) (1).pptx small presentation for students
Anesthesia in Laparoscopic Surgery in India
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Renaissance Architecture: A Journey from Faith to Humanism
Basic Mud Logging Guide for educational purpose
Pre independence Education in Inndia.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Sports Quiz easy sports quiz sports quiz
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
01-Introduction-to-Information-Management.pdf
Supply Chain Operations Speaking Notes -ICLT Program
2.FourierTransform-ShortQuestionswithAnswers.pdf
O7-L3 Supply Chain Operations - ICLT Program
Classroom Observation Tools for Teachers
Cell Types and Its function , kingdom of life
Microbial disease of the cardiovascular and lymphatic systems

A simple text ine segmentation method for handwritten documents1

  • 1. A simple text-line segmentation method for handwritten documents M.Ravi Kumar1, R. Pradeep2, B.S.Puneeth Kumar3, and Prasad Babu4 Dept. of P.G.Studies in Computer Science, Kuvempu University Jnana Sahyadri, Shankaraghatta-577451 E-mail: ravi2142@yahoo.co.in, punithashp@gmail.com, kamitkar.prasad9@gmail.com Abstract: Text line segmentation is an important step mainly due to the irregularity of lay- out and character shapes because inaccurately segmented text lines will cause errors in originated from the variability of writing styles. For the recognition stage.. The nature of handwriting makes the unconstrained handwritten documents, text line seg- mentation process of text line segmentation very challenging. Text and character segmentation-recognition are not solved though characteristics can vary in font, size, orientation, alignment, enormous efforts have been devoted to them and great color, contrast, and background information. These variations advances have been made. Text line segmentation of turn the process of word detection complex and difficult. handwritten documents is much more difficult than that of Since handwritten text can vary greatly depending on the user printed documents. Unlike that printed documents have skills, disposition and cultural background. In this work we approximately straight and parallel text lines, the lines in have proposed the method which works on the different handwritten documents are often un-uniformly skewed and intensity values for extracting the text-lines. curved. Moreover, the spaces between handwritten text lines are often not obvious compared to the spaces between within- line characters, and some text lines may interfere with each Keywords: other. Therefore, many text line detection techniques, such as projection analysis and K-nearest neighbor connected Handwritten document, segmentation, text-lines; components (CCs) grouping, are not able to segment handwritten text lines successfully[2]. 1. Introduction: On the other hand, piece-wise projections are sensitive to character’s size variation within text lines and significant gaps Segmentation of a document image into between successive words. These occurrences influence the its basic entities, namely, text lines and words, is considered as effectiveness of smearing methods too. In such cases, the a non-trivial problem to solve in the field of handwritten results of two adjacent zones may be ambiguous, affecting the document recognition. The difficulties that arise in drawing of text-line separators along the document width. To handwritten documents make the segmentation procedure a deal with these problems we introduce a smooth version of the challenging task. Different types of difficulties are projection profiles to over segment each zone into candidate encountered in the text line segmentation and the word text and gap regions. Then, we reclassify these regions by segmentation procedure. In case of text line segmentation applying an HMM formulation that enhances statistics from procedure, major difficulties include the difference in the skew the whole document page. Starting from left and moving to the angle between lines on the page or even along the same text right we combine separators of consecutive zones considering line, overlapping words and adjacent text lines touching. their proximity and the local foreground density [3]. Furthermore, the frequent appearance of accents in many These piece-wise projection based methods have a few languages (e.g. French, Greek) makes the text line shortcomings: (a) they generate too many potential separating segmentation a challenging task. In word segmentation, lines, (b) the parameter of stripe width is predefined, (c) text- difficulties that arise include the appearance of skew and slant lines should not have significant skew as mentioned in and (d) in the text line, the existence of punctuation marks along the if there is no potential piece-wise line in the first and last text line and the non- uniform spacing of words which is a stripes, drawing a complete separating line will become common residual in handwritten documents[1]. Text line impossible in the any algorithms. These shortcomings are segmentation from document images is one of the major observed based on an experiment that we have conducted with problems in document image analysis. It provides crucial a number of text-pages. Some authors also used skew information for the tasks of text block segmentation, character information for text-line separation. In an unconstrained seg-mentation and recognition, and text string recognition. handwritten text-page, it is very difficult to detect the Whereas the difficulty of machine-printed document analysis orientation of each line on the basis of the skew calculated for mainly lies in the complex layout structure and degraded the entire page. Therefore, these methods may not work image quality, handwritten document analysis is difficult properly [4]. 1
  • 2. The concept of the Hough transform is employed in the field densities to assign overlapping CCs into text lines [7]. of document analysis for many purposes such as skew Experimental results on a collection of 720 documents detection, line detection, slant detection and text-line (English, Arabic and children's handwriting) show that segmentation. The Hough transform is employed for text-line 97.31% of text lines were segmented correctly. The writers segmentation in different scripts. A block-based Hough mention that “a more intelligent approach to cut an transform is presented which is a modification of the overlapping component is the goal of future work”. A recent conventional Hough transform methodology. The algorithm approach [8] uses block-based Hough transform to detect lines includes partitioning of the connected component domain into and merging methods to correct false alarms. Although the three spatial sub-domains and applying a block-based Hough algorithm achieves a 93.1% detection rate and a 96% transform to detect the potential text lines [5]. recognition rate, it is not flexible to follow variation of skew Many efforts have been devoted to the difficult problem of angle along the same text line and not very precise in the hand- written text line segmentation. The methods can be assignment of accents to text lines. Li et al. [9] discuss the roughly categorized into three classes: top-down, bottom-up, text-line detection task as an image segmentation problem. and hybrid. Top-down methods partition the document image They use a Gaussian window to convert a binary image into a recursively into text regions, text lines, and words/characters smooth gray-scale. Then they adopt the level set method to with the assumption of straight lines. Bottom-up methods evolve text-line boundaries and finally, geometrical constrains group small units of image (pixels, CCs, characters, words, are imposed to group CCs or segments as text lines. They etc.) into text lines and then text regions. Bottom-up grouping report pixel-level hit rates varying from 92% to 98% on can be viewed as a clustering process, which aggregates image different scripts and mention that “the major failures happen components according to proximity and does not rely on the because two neighboring text lines touch each other assumption of straight lines. Hybrid methods combine bottom- significantly”. A similar method [10] evaluates eight different up grouping and top-down partitioning in different ways. All spatial measures between pairs of CCs to locate words in the three approaches have their disadvantages. Top-down handwritten postal ad- dresses. The best metric proved to be methods do not perform well on curved and overlapping text the one which combines the result of the minimum run-length lines. The performance of bottom-up grouping relies on some method and the vertical overlapping of two successive CCs. heuristic rules or artificial parameters, such as the between- Additionally, this metric is adjusted by utilizing the results of component distance metric for clustering. On the other hand, a punctuation detection algorithm (periods and commas). hybrid methods are complicated in computation, and the Then, a suitable threshold is computed by an iterative design of a robust combination scheme is non-trivial [2]. procedure. The algorithm tested on 1000 address images and A thinning operation has also been used by other researchers performed an error rate of about 10%. for text-line segmentation of Japanese and Indian text Manmatha and Rothfeder [11] propose an effective for noisy documents. Thinning algorithms followed by post-processing historical documents scale space approach. The line image is operations are employed for the entire background region of filtered with an anisotropic Laplacian at several scales in order an input text image to detect the separating borderlines. to produce blobs which correspond to portions of characters at Recently, some techniques have used level set, active contour small scales and to words at larger scales. The optimum scale and a variational Bayes method for text-line segmentation. is estimated by three different techniques (line height, page Density estimation and the level set method (LSM) were averaging and free search) from which the line height showed utilized for text-line segmentation. A probability map is best results. Much more challenging task is line segmentation estimated from an input document image, where each element in historical documents due to a great deal of noise. Feldbach represents the probability of the original pixel belonging to a and Tonnies [12] have proposed a bottom up method for text line. The level set method (LSM) is utilized to determine historical church documents that requires parameters to be set the boundary evolution of neighboring text lines. At first, a according to the type of handwriting. They report a 90% matched filter bank approach is used for smoothing the input correct segmentation rate for constant parameter values which text image. The central line of text-line components is then rises to 97% for adjusted ones. computed using ridges over the smoothed image. Finally, the Another integrated system for such documents [13] creates a active contours (snakes) over the ridges are adapted to obtain foreground/background transition count map to find probable the text-line segmentation result [4]. locations of text lines and applies min-cut/max-flow algorithm to separate initially connected text lines .The method performs 2. Related Work: high accuracy (over 98%) in 20 images of George Washington's manuscript. In this section, we give a brief review of recent work on text line and word segmentation in handwritten document images. As far as we know, the following techniques either achieved 3. Segmentation challenges: the best results in the corresponding test datasets, or are elements of integrated systems for specific tasks. One of the In this section we the challenges involved in the segmentation most accurate methods uses piece-wise projection pro- files to of the text-lines. When dealing with handwritten text, line obtain an initial set of candidate lines and bivariate Gaussian 2
  • 3. segmentation has to solve some obstacles that are uncommon 4. Motivation: in modern printed text [6]. Among the most predominant are: The above challenges motivated us to take this challenging work. 3.1 Skewed lines: Lines of text in general are not straight. These lines are not parallel to each other. 5. Proposed Method: In the proposed method our objective is to identify the boundary of the text-lines, which consist two steps (i) Generating partial boundary line and (ii) Generating complete boundary line. By using these partial lines it is very difficult to differentiate between the two lines. Because these partial lines are having gaps and they are broken. Thus we need to construct the complete line which acts as differentiator for identifying the text-lines. In constructing the complete line we will consider the highest frequent y co-ordinate value in each partial boundary lines, y co-ordinate values are the complete Figure 3.1: Skewed Lines boundaries for each text-lines. In last step we segment the each text-line by representing the each text-line with different 3.2 Fluctuating lines: Lines of text are partially or fully colors. connected to other text-lines. 5.1 Generating partial text-line boundary: The partial text-line boundary is generated by blocking the text-lines. Using morphological operations like erosion. Here blocking is done for filling the holes and gaps between the words. This helps in drawing the partial line. Here the lines are drawn at the edge of the every blocked text-line. However these partial lines are not sufficient to differentiate the text- lines. This is shown in figure 5.1. Figure 3.2: Fluctuating Lines 5.2 Generating complete text-line boundary: 3.3 Line proximity: Small gaps between neighboring text lines will cause touching and overlapping of components, As in the previous step we obtained the partial boundary lines, usually words or letters, between lines and irregularity in these broken boundary lines are not sufficient for segmenting geometrical properties of the line, such as line width, height, the text-lines, thus we need to generate complete line which is distance in between words and lines, leftmost position etc. continuous. So now by using these partial lines, we draw the complete boundary lines with the help of the frequent vertical points. These y co-ordinate values are used to differentiate between text-lines, and these complete boundary lines helps in segmentation. This is showed in figure 5.2 5.3 Text-line segmentation: Once the complete boundary lines were drawn, it is easy to segment the text-lines by assigning different values to each character in between the lines. In this section, we give the different intensity values for the characters in a text-line between the two drawn lines. Through the different intensity values of the characters in a text-line, this method recognizes Figure 3.3: Line proximity the different lines. The segmentation for different languages Source: [6] are showed in figure 5.3 3
  • 4. Figure 5.1: Illustration of Partial text-line boundary detection. (a) A handwritten document. (b) Blocked text- lines. (c) Partial boundary lines Figure 5.2: Illustration of complete text-line boundary detection. (a) Complete boundary lines with partial boundary lines. (b) Complete boundary lines without partial boundary lines 4
  • 5. Figure 5.3: Illustration of segmented text-lines. (a) Segmented English document. (b) Segmented Kannada document (c) Segmented Hindi document (d) Segmented6. Block diagram of proposed method: English document 5
  • 6. 7. Results and discussion: We conducted the experiments on the languages like Kannada, [7] Z. Razak, K. Zulkiflee, et al., Off-line handwriting text line English, Hindi and Arabic of 500 documents (kannada-150, segmentation: a review, International Journal of Computer English-200, Hindi-100 and Arabic-50) and obtained the Science and Network Security 8 (7) (2008) 12–20. accurate results. [8] G. Louloudis, B. Gatos, C. Halatsis, Text line detection in unconstrained handwritten documents using a block-based Hough transform approach, in: Proceedings of International 8. Conclusion and future work: Conference on Document Analysis and Recognition, 2007, pp. 599–603 In this work we are mainly concentrated on extracting required [9] Y. Li, Y. Zheng, D. Doermann, S. Jaeger, Script- text-line from the given document and obtained the specified independent text line segmentation in freestyle handwritten text-line accurately. The proposed method works well for documents, IEEE Transactions on Pattern Analysis and segmenting the text-line of handwritten document. Our Machine Intelligence 30 (8) (2008) 1313–1329. method only works on the fixed word length and text-lines [10] G. Seni, E. Cohen, External word segmentation of off- without skew; in our future work we will improve the results line handwritten text lines, Pattern Recognition 27 (1994) 41– by segmenting the text-lines with above described limitations. 52. [11] R. Manmatha, J.L. Rothfeder, A scale space approach for 9. References: automatically segmenting words from historical handwritten documents, IEEE Transactions on Pattern Analysis and [1] G. Louloudisa, B.Gatosb, I.Pratikakisb, C.HalatsisaText Machine Intelligence 27 (8) (2005) 1212–1225. line and word segmentation of handwritten documents. Pattern [12] M. Feldbach, K.D. Tonnies, Line detection and Recognition (2008) pp. 3169 – 3183 segmentation in historical church registers, in: Proceedings of [2] Fei Yin, Cheng-LinLiu. Handwritten Chinese text line International Conference on Document Analysis and segmentation by clustering with distance metric learning Recognition, 2001, pp. 743–747. Pattern Recognition (2009) pp. 3146 -- 3157 [13] D.J. Kennard, W.A. Barrett, Separating lines of text in [3] Vassilis Papavassiliou, Themos Stafylakis, et al.., free-form handwritten historical documents, in: Proceedings of Handwritten document image segmentation into textlines and International Workshop on Document Image Analysis for words. Pattern Recognition (2010) pp. 369 -- 377 Libraries, 2006, pp. 12–23. [4] Alireza Alaei UmapadaPal, et al..,. A new scheme for unconstrained handwritten text-line segmentation. Pattern Recognition (2011) pp. 917–928 [5] A Block-Based Hough Transform Mapping for Text Line Detection in Handwritten Documents G. Louloudis1, B. Gatos2, I. Pratikakis2, K. Halatsis1 [6] Ashu Kumar , Simpel Rani Jindal, Galaxy Singla Line segmentation using contour tracing, 2012 Vol 3 6