Tesseract OCR Engine

Contents
• Introduction & history of OCR
• Tesseract architecture & methods
• Announcing Tesseract 2.00
• Training Tesseract
• Future enhancements

A Brief History of OCR
• What is Optical Character Recognition?
• A Brief History of OCR
• OCR predates electronic computers!

A Brief History of OCR
• 1929 – Digit recognition machine
• 1953 – Alphanumeric recognition machine
• 1965 – US Mail sorting
• 1965 – British banking system
• 1976 – Kurzweil reading machine
• 1985 – Hardware-assisted PC software
• 1988 – Software-only PC software
• 1994-2000 – Industry consolidation

Tesseract Background
• Developed on HP-UX at HP between 1985
• and 1994 to run in a desktop scanner.
• Came neck and neck with Caere and XIS
• in the 1995 UNLV test.
• (See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
• Never used in an HP product.
• Open sourced in 2005. Now on:
• http://code.google.com/p/tesseract-ocr
• Highly portable.

Tesseract OCR Architecture
• Baselines are rarely perfectly straight
• Text Line Finding – skew independent –
• published at ICDAR’95 Montreal.
• (http://scholar.google.com/scholar?q=skew+detection+smith)
• Baselines are approximated by quadratic splines
• to account for skew and curl.
• Meanline, ascender and descender lines are a
• constant displacement from baseline.
• Critical value is the x-height.

Spaces between words are tricky too
• Italics, digits, punctuation all create special-case font-dependent
spacing.
• • Fully justified text in narrow columns can have vastly varying
spacing on different lines.
• Tesseract: Recognize Word
• Outline Approximation
• Polygonal approximation is a double-edged sword.
• Noise and some pertinent information are both lost.

Tesseract: Features and Matching
• Static classifier uses outline fragments as features. Broken characters
are easily recognizable by a small->large matching process in classifier.
(This is slow.)
• Adaptive classifier uses the same technique!
• (Apart from normalization method)

Announcing tesseract-2.00
• Fully Unicode (UTF-8) capable
• Already trained for 6 Latin-based
• languages (Eng, Fra, Ita, Deu, Spa, Nld)
• Code and documented process to train at
• http://code.google.com/p/tesseract-ocr
• UNLV regression test framework
• Other minor fixes

Commercial OCR v Tesseract
• Page layout analysis.
• More languages.
• Improve accuracy.
• Add a UI.

Tesseract OCR Engine

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Tesseract OCR Engine (20)

More from Raghu nath (20)

Recently uploaded (20)

Tesseract OCR Engine