SlideShare a Scribd company logo
Tesseract OCR Engine 
• Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. 
• HP had independently-developed page layout analysis technology that was used in products, (and therefore not 
released for open-source) Tesseract never needed its own page layout analysis. Tesseract therefore assumes that 
its input is a binary image with optional polygonal text regions defined. 
• Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and 
possibly remain so even now. The first step is a connected component analysis in which outlines of the 
components are stored. This was a computationally expensive design decision at the time, but had a significant 
advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to 
detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine 
able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into 
Blobs. 
• Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text 
lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped 
immediately by character cells. Proportional text is broken into words using definite spaces and fuzzy spaces 
• Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each 
• word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive 
• classifier then gets a chance to more accurately recognize text lower down the page.
• Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words 
differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. Proportional text is broken 
into words using definite spaces and fuzzy spaces. 
• Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is 
satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text 
lower down the page. 
• Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page, a second pass is run 
over the page, in which words that were not recognized well enough are recognized again. 
• A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate smallcap text. 
• Fixed Pitch Detection and Chopping 
• Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into 
characters using the pitch, and disables the chopper and associator on these words for the word recognition 
• Proportional Word Finding 
• Non-fixed-pitch or proportional text spacing is a highly non-trivial task. Fig. 3 illustrates some typical problems. The gap between the tens and 
units of ‘11.9%’ is a similar size to the general space, and is certainly larger than the kerned space between ‘erated’ and ‘junk’. There is no 
horizontal gap at all between the bounding boxes of ‘of’ and ‘financial’. Tesseract solves most of these problems by measuring gaps in a limited 
vertical range between the baseline and mean line. Spaces that are close to the threshold at this stage are made fuzzy, so that a final decision 
can be made after word recognition. 
• Word Recognition 
• Part of the recognition process for any character recognition engine is to identify how a word should be segmented into characters. The 
initial segmentation output from line finding is classified first. The rest of the word recognition step applies only to non-fixedpitch text
• Line and Word Finding 
• The line finding algorithm is one of the few parts of Tesseract that has previously been published [3]. The line finding algorithm is designed so that a skewed page can be 
recognized without having to de-skew,thus saving loss of image quality. The key parts of the process are blob filtering and line construction. 
• Assuming that page layout analysis has already provided text regions of a roughly uniform text size, a simple percentile height filter removes drop-caps and vertically 
touching characters. The median height approximates the text size in the region, so it is safe to filter out blobs that are smaller than some fraction of the median height, 
being most likely punctuation, diacritical marks and noise 
• The filtered blobs are more likely to fit a model of non-overlapping, parallel, but sloping lines. Sorting and processing the blobs by x-coordinate makes it possible to assign 
blobs to a unique text line, while tracking the slope across the page, with greatly reduced danger of assigning to an incorrect text line in the presence of skew. Once the 
filtered blobs have been assigned to lines, a least median of squares fit [4] is used to estimate the baselines, and the filtered-out blobs are fitted back into the appropriate 
lines. The final step of the line creation process merges blobs that overlap by at least half horizontally, putting diacritical marks together with the correct base and correctly 
associating parts of some broken characters. 
• Baseline Fitting 
• Once the text lines have been found, the baselines are fitted more precisely using a quadratic spline. This was another first for an OCR system, and enabled Tesseract to 
handle pages with curved baselines [5], which are a common artifact in scanning, and not just at book bindings. 
• The baselines are fitted by partitioning the blobs into groups with a reasonably continuous displacement for the original straight baseline. A quadratic spline is fitted to the 
most populous partition, (assumed to be the baseline) by a least squares fit. The quadratic spline has the advantage that this calculation is reasonably stable, but the 
disadvantage that discontinuities can arise when multiple spline segments are required. A more traditional cubic spline [6] might work better. 
• Fig.1 shows an example of a line of text with a fitted baseline, descender line, meanline and ascender line. All these lines are “parallel” (the y separation is a constant over 
the entire length) and slightly curved. The ascender line is cyan (prints as light gray) and the black line above it is actually straight. Close inspection shows that the cyan/gray 
line is curved relative to the straight black line above it.
• Chopping Joined Characters While the result from a word (see section 6) is unsatisfactory, Tesseract attempts to improve the result by chopping 
the blob with worst confidence from the character classifier. Candidate chop points are found from concave vertices of a polygonal approximation 
[2] of the outline, and may have either another concave vertex opposite, or a line segment. It may take up to 3 pairs of chop points to successfully 
separate joined characters from the ASCII set 
• Chops are executed in priority order. Any chop that fails to improve the confidence of the result is undone, but not completely discarded so that 
the chop can be re-used later by the associator if needed 
Associating Broken Characters: 
• When the potential chops have been exhausted, if the word is still not good enough, it is given to the associator. The associator makes an A* (best 
first) search of the segmentation graph of possible combinations of the maximally chopped blobs into candidate characters. It does this without 
actually building the segmentation graph, but instead maintains a hash table of visited states. The A* search proceeds by pulling candidate new 
states from a priority queue and evaluating them by classifying unclassified combinations of fragments. 
• It may be argued that this fully-chop-then-associate approach is at best inefficient, at worst liable to miss important chops, and that may well be 
the case. The advantage is that the chop-then-associate scheme simplifies the data structures that would be required to maintain the full 
segmentation graph 
• When the A* segmentation search was first implemented in about 1989, Tesseract’s accuracy on broken characters was well ahead of the 
commercial 
• engines of the day. Fig. 5 is a typical example. An essential part of that success was the character classifier that could easily recognize broken 
• characters.

More Related Content

PDF
Os Raysmith
PPTX
OCR using Tesseract
PPTX
Tesseract OCR Engine
PDF
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
PPTX
Tamil OCR using Tesseract OCR Engine
PPTX
Understanding Autoencoder (Deep Learning Book, Chapter 14)
PPT
Object Oriented Programming lecture 1
PDF
Neural Machine Translation via Binary Code Prediction
Os Raysmith
OCR using Tesseract
Tesseract OCR Engine
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Tamil OCR using Tesseract OCR Engine
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Object Oriented Programming lecture 1
Neural Machine Translation via Binary Code Prediction

Similar to Tasract OCR (20)

PPTX
Support-Vector-Machine (Supervised Learning).pptx
PPTX
Detecting text from natural images with Stroke Width Transform
PPTX
GRAPHS, BREADTH FIRST TRAVERSAL AND DEPTH FIRST TRAVERSAL
PDF
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
PDF
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
PPT
Chap11 slides
PPTX
Information Extraction
PPTX
Information Extraction
PPTX
Information Extraction
PPTX
cs 601 - lecture 1.pptx
PDF
Probabilistic data structures. Part 4. Similarity
PDF
7 layout analysis
PPTX
Text to speech conversation in gujarati
PPTX
Designing data intensive applications
PPTX
Horspool Pattern matching Algorithm.pptx
PDF
Java Abs Online Handwritten Script Recognition
PPTX
Regular expressions
PPTX
Splay tree
PPTX
Hashing And Hashing Tables
PPTX
NLP_KASHK:Parsing with Context-Free Grammar
Support-Vector-Machine (Supervised Learning).pptx
Detecting text from natural images with Stroke Width Transform
GRAPHS, BREADTH FIRST TRAVERSAL AND DEPTH FIRST TRAVERSAL
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
COHESIVE MULTI-ORIENTED TEXT DETECTION AND RECOGNITION STRUCTURE IN NATURAL S...
Chap11 slides
Information Extraction
Information Extraction
Information Extraction
cs 601 - lecture 1.pptx
Probabilistic data structures. Part 4. Similarity
7 layout analysis
Text to speech conversation in gujarati
Designing data intensive applications
Horspool Pattern matching Algorithm.pptx
Java Abs Online Handwritten Script Recognition
Regular expressions
Splay tree
Hashing And Hashing Tables
NLP_KASHK:Parsing with Context-Free Grammar
Ad

More from Raghu nath (20)

PPTX
Mongo db
PDF
Ftp (file transfer protocol)
PDF
MS WORD 2013
PDF
Msword
PDF
Ms word
PDF
Javascript part1
PDF
Regular expressions
PDF
Selection sort
PPTX
Binary search
PPTX
JSON(JavaScript Object Notation)
PDF
Stemming algorithms
PPTX
Step by step guide to install dhcp role
PPTX
Network essentials chapter 4
PPTX
Network essentials chapter 3
PPTX
Network essentials chapter 2
PPTX
Network essentials - chapter 1
PPTX
Python chapter 2
PPTX
python chapter 1
PPTX
Linux Shell Scripting
PPTX
Mongo db
Ftp (file transfer protocol)
MS WORD 2013
Msword
Ms word
Javascript part1
Regular expressions
Selection sort
Binary search
JSON(JavaScript Object Notation)
Stemming algorithms
Step by step guide to install dhcp role
Network essentials chapter 4
Network essentials chapter 3
Network essentials chapter 2
Network essentials - chapter 1
Python chapter 2
python chapter 1
Linux Shell Scripting
Ad

Tasract OCR

  • 1. Tesseract OCR Engine • Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994. • HP had independently-developed page layout analysis technology that was used in products, (and therefore not released for open-source) Tesseract never needed its own page layout analysis. Tesseract therefore assumes that its input is a binary image with optional polygonal text regions defined. • Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and possibly remain so even now. The first step is a connected component analysis in which outlines of the components are stored. This was a computationally expensive design decision at the time, but had a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into Blobs. • Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. Proportional text is broken into words using definite spaces and fuzzy spaces • Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each • word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive • classifier then gets a chance to more accurately recognize text lower down the page.
  • 2. • Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. Proportional text is broken into words using definite spaces and fuzzy spaces. • Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page. • Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page, a second pass is run over the page, in which words that were not recognized well enough are recognized again. • A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate smallcap text. • Fixed Pitch Detection and Chopping • Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition • Proportional Word Finding • Non-fixed-pitch or proportional text spacing is a highly non-trivial task. Fig. 3 illustrates some typical problems. The gap between the tens and units of ‘11.9%’ is a similar size to the general space, and is certainly larger than the kerned space between ‘erated’ and ‘junk’. There is no horizontal gap at all between the bounding boxes of ‘of’ and ‘financial’. Tesseract solves most of these problems by measuring gaps in a limited vertical range between the baseline and mean line. Spaces that are close to the threshold at this stage are made fuzzy, so that a final decision can be made after word recognition. • Word Recognition • Part of the recognition process for any character recognition engine is to identify how a word should be segmented into characters. The initial segmentation output from line finding is classified first. The rest of the word recognition step applies only to non-fixedpitch text
  • 3. • Line and Word Finding • The line finding algorithm is one of the few parts of Tesseract that has previously been published [3]. The line finding algorithm is designed so that a skewed page can be recognized without having to de-skew,thus saving loss of image quality. The key parts of the process are blob filtering and line construction. • Assuming that page layout analysis has already provided text regions of a roughly uniform text size, a simple percentile height filter removes drop-caps and vertically touching characters. The median height approximates the text size in the region, so it is safe to filter out blobs that are smaller than some fraction of the median height, being most likely punctuation, diacritical marks and noise • The filtered blobs are more likely to fit a model of non-overlapping, parallel, but sloping lines. Sorting and processing the blobs by x-coordinate makes it possible to assign blobs to a unique text line, while tracking the slope across the page, with greatly reduced danger of assigning to an incorrect text line in the presence of skew. Once the filtered blobs have been assigned to lines, a least median of squares fit [4] is used to estimate the baselines, and the filtered-out blobs are fitted back into the appropriate lines. The final step of the line creation process merges blobs that overlap by at least half horizontally, putting diacritical marks together with the correct base and correctly associating parts of some broken characters. • Baseline Fitting • Once the text lines have been found, the baselines are fitted more precisely using a quadratic spline. This was another first for an OCR system, and enabled Tesseract to handle pages with curved baselines [5], which are a common artifact in scanning, and not just at book bindings. • The baselines are fitted by partitioning the blobs into groups with a reasonably continuous displacement for the original straight baseline. A quadratic spline is fitted to the most populous partition, (assumed to be the baseline) by a least squares fit. The quadratic spline has the advantage that this calculation is reasonably stable, but the disadvantage that discontinuities can arise when multiple spline segments are required. A more traditional cubic spline [6] might work better. • Fig.1 shows an example of a line of text with a fitted baseline, descender line, meanline and ascender line. All these lines are “parallel” (the y separation is a constant over the entire length) and slightly curved. The ascender line is cyan (prints as light gray) and the black line above it is actually straight. Close inspection shows that the cyan/gray line is curved relative to the straight black line above it.
  • 4. • Chopping Joined Characters While the result from a word (see section 6) is unsatisfactory, Tesseract attempts to improve the result by chopping the blob with worst confidence from the character classifier. Candidate chop points are found from concave vertices of a polygonal approximation [2] of the outline, and may have either another concave vertex opposite, or a line segment. It may take up to 3 pairs of chop points to successfully separate joined characters from the ASCII set • Chops are executed in priority order. Any chop that fails to improve the confidence of the result is undone, but not completely discarded so that the chop can be re-used later by the associator if needed Associating Broken Characters: • When the potential chops have been exhausted, if the word is still not good enough, it is given to the associator. The associator makes an A* (best first) search of the segmentation graph of possible combinations of the maximally chopped blobs into candidate characters. It does this without actually building the segmentation graph, but instead maintains a hash table of visited states. The A* search proceeds by pulling candidate new states from a priority queue and evaluating them by classifying unclassified combinations of fragments. • It may be argued that this fully-chop-then-associate approach is at best inefficient, at worst liable to miss important chops, and that may well be the case. The advantage is that the chop-then-associate scheme simplifies the data structures that would be required to maintain the full segmentation graph • When the A* segmentation search was first implemented in about 1989, Tesseract’s accuracy on broken characters was well ahead of the commercial • engines of the day. Fig. 5 is a typical example. An essential part of that success was the character classifier that could easily recognize broken • characters.