Tasract OCR

Tesseract OCR Engine
• Tesseract is an open-source OCR engine that was developed at HP between 1984 and 1994.
• HP had independently-developed page layout analysis technology that was used in products, (and therefore not
released for open-source) Tesseract never needed its own page layout analysis. Tesseract therefore assumes that
its input is a binary image with optional polygonal text regions defined.
• Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and
possibly remain so even now. The first step is a connected component analysis in which outlines of the
components are stored. This was a computationally expensive design decision at the time, but had a significant
advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to
detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine
able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into
Blobs.
• Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text
lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped
immediately by character cells. Proportional text is broken into words using definite spaces and fuzzy spaces
• Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each
• word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive
• classifier then gets a chance to more accurately recognize text lower down the page.

• Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words
differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. Proportional text is broken
into words using definite spaces and fuzzy spaces.
• Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is
satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text
lower down the page.
• Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page, a second pass is run
over the page, in which words that were not recognized well enough are recognized again.
• A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate smallcap text.
• Fixed Pitch Detection and Chopping
• Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into
characters using the pitch, and disables the chopper and associator on these words for the word recognition
• Proportional Word Finding
• Non-fixed-pitch or proportional text spacing is a highly non-trivial task. Fig. 3 illustrates some typical problems. The gap between the tens and
units of ‘11.9%’ is a similar size to the general space, and is certainly larger than the kerned space between ‘erated’ and ‘junk’. There is no
horizontal gap at all between the bounding boxes of ‘of’ and ‘financial’. Tesseract solves most of these problems by measuring gaps in a limited
vertical range between the baseline and mean line. Spaces that are close to the threshold at this stage are made fuzzy, so that a final decision
can be made after word recognition.
• Word Recognition
• Part of the recognition process for any character recognition engine is to identify how a word should be segmented into characters. The
initial segmentation output from line finding is classified first. The rest of the word recognition step applies only to non-fixedpitch text

• Line and Word Finding
• The line finding algorithm is one of the few parts of Tesseract that has previously been published [3]. The line finding algorithm is designed so that a skewed page can be
recognized without having to de-skew,thus saving loss of image quality. The key parts of the process are blob filtering and line construction.
• Assuming that page layout analysis has already provided text regions of a roughly uniform text size, a simple percentile height filter removes drop-caps and vertically
touching characters. The median height approximates the text size in the region, so it is safe to filter out blobs that are smaller than some fraction of the median height,
being most likely punctuation, diacritical marks and noise
• The filtered blobs are more likely to fit a model of non-overlapping, parallel, but sloping lines. Sorting and processing the blobs by x-coordinate makes it possible to assign
blobs to a unique text line, while tracking the slope across the page, with greatly reduced danger of assigning to an incorrect text line in the presence of skew. Once the
filtered blobs have been assigned to lines, a least median of squares fit [4] is used to estimate the baselines, and the filtered-out blobs are fitted back into the appropriate
lines. The final step of the line creation process merges blobs that overlap by at least half horizontally, putting diacritical marks together with the correct base and correctly
associating parts of some broken characters.
• Baseline Fitting
• Once the text lines have been found, the baselines are fitted more precisely using a quadratic spline. This was another first for an OCR system, and enabled Tesseract to
handle pages with curved baselines [5], which are a common artifact in scanning, and not just at book bindings.
• The baselines are fitted by partitioning the blobs into groups with a reasonably continuous displacement for the original straight baseline. A quadratic spline is fitted to the
most populous partition, (assumed to be the baseline) by a least squares fit. The quadratic spline has the advantage that this calculation is reasonably stable, but the
disadvantage that discontinuities can arise when multiple spline segments are required. A more traditional cubic spline [6] might work better.
• Fig.1 shows an example of a line of text with a fitted baseline, descender line, meanline and ascender line. All these lines are “parallel” (the y separation is a constant over
the entire length) and slightly curved. The ascender line is cyan (prints as light gray) and the black line above it is actually straight. Close inspection shows that the cyan/gray
line is curved relative to the straight black line above it.

• Chopping Joined Characters While the result from a word (see section 6) is unsatisfactory, Tesseract attempts to improve the result by chopping
the blob with worst confidence from the character classifier. Candidate chop points are found from concave vertices of a polygonal approximation
[2] of the outline, and may have either another concave vertex opposite, or a line segment. It may take up to 3 pairs of chop points to successfully
separate joined characters from the ASCII set
• Chops are executed in priority order. Any chop that fails to improve the confidence of the result is undone, but not completely discarded so that
the chop can be re-used later by the associator if needed
Associating Broken Characters:
• When the potential chops have been exhausted, if the word is still not good enough, it is given to the associator. The associator makes an A* (best
first) search of the segmentation graph of possible combinations of the maximally chopped blobs into candidate characters. It does this without
actually building the segmentation graph, but instead maintains a hash table of visited states. The A* search proceeds by pulling candidate new
states from a priority queue and evaluating them by classifying unclassified combinations of fragments.
• It may be argued that this fully-chop-then-associate approach is at best inefficient, at worst liable to miss important chops, and that may well be
the case. The advantage is that the chop-then-associate scheme simplifies the data structures that would be required to maintain the full
segmentation graph
• When the A* segmentation search was first implemented in about 1989, Tesseract’s accuracy on broken characters was well ahead of the
commercial
• engines of the day. Fig. 5 is a typical example. An essential part of that success was the character classifier that could easily recognize broken
• characters.

Tasract OCR

More Related Content

Similar to Tasract OCR (20)

More from Raghu nath (20)

Tasract OCR