SlideShare a Scribd company logo
5
Most read
Tesseract OCR Engine
Contents 
• Introduction & history of OCR 
• Tesseract architecture & methods 
• Announcing Tesseract 2.00 
• Training Tesseract 
• Future enhancements
A Brief History of OCR 
• What is Optical Character Recognition? 
• A Brief History of OCR 
• OCR predates electronic computers!
Tesseract OCR Engine
A Brief History of OCR 
• 1929 – Digit recognition machine 
• 1953 – Alphanumeric recognition machine 
• 1965 – US Mail sorting 
• 1965 – British banking system 
• 1976 – Kurzweil reading machine 
• 1985 – Hardware-assisted PC software 
• 1988 – Software-only PC software 
• 1994-2000 – Industry consolidation
Tesseract Background 
• Developed on HP-UX at HP between 1985 
• and 1994 to run in a desktop scanner. 
• Came neck and neck with Caere and XIS 
• in the 1995 UNLV test. 
• (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) 
• Never used in an HP product. 
• Open sourced in 2005. Now on: 
• http://code.google.com/p/tesseract-ocr 
• Highly portable.
Tesseract OCR Architecture 
• Baselines are rarely perfectly straight 
• Text Line Finding – skew independent – 
• published at ICDAR’95 Montreal. 
• (http://scholar.google.com/scholar?q=skew+detection+smith) 
• Baselines are approximated by quadratic splines 
• to account for skew and curl. 
• Meanline, ascender and descender lines are a 
• constant displacement from baseline. 
• Critical value is the x-height.
Spaces between words are tricky too 
• Italics, digits, punctuation all create special-case font-dependent 
spacing. 
• • Fully justified text in narrow columns can have vastly varying 
spacing on different lines. 
• Tesseract: Recognize Word 
• Outline Approximation 
• Polygonal approximation is a double-edged sword. 
• Noise and some pertinent information are both lost.
Tesseract: Features and Matching 
• Static classifier uses outline fragments as features. Broken characters 
are easily recognizable by a small->large matching process in classifier. 
(This is slow.) 
• Adaptive classifier uses the same technique! 
• (Apart from normalization method)
Announcing tesseract-2.00 
• Fully Unicode (UTF-8) capable 
• Already trained for 6 Latin-based 
• languages (Eng, Fra, Ita, Deu, Spa, Nld) 
• Code and documented process to train at 
• http://code.google.com/p/tesseract-ocr 
• UNLV regression test framework 
• Other minor fixes
Commercial OCR v Tesseract 
• Page layout analysis. 
• More languages. 
• Improve accuracy. 
• Add a UI.

More Related Content

PPTX
OCR using Tesseract
PPTX
Handwritten character recognition using artificial neural network
PPTX
Handwritten Character Recognition
PPTX
Optical Character Recognition (OCR) based Retrieval
PDF
Os Raysmith
PPTX
Introduction to text to speech
PPTX
OCR Presentation (Optical Character Recognition)
PDF
PHP Basic & Variables
OCR using Tesseract
Handwritten character recognition using artificial neural network
Handwritten Character Recognition
Optical Character Recognition (OCR) based Retrieval
Os Raysmith
Introduction to text to speech
OCR Presentation (Optical Character Recognition)
PHP Basic & Variables

What's hot (20)

PPTX
Scaling and shearing
PPTX
Face recognition attendance system
PDF
TensorFlow
PDF
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...
PPTX
Artificial intelligence in speech recognition
PDF
IEEE EED2021 AI use cases in Computer Vision
PPTX
Optical Character Recognition( OCR )
PDF
Natural language processing (NLP) introduction
PPTX
Deep Learning for Natural Language Processing
PPTX
Speech recognition system seminar
PDF
Language translation with Deep Learning (RNN) with TensorFlow
 
PPTX
Language translator
PPTX
Computer vision
PPTX
Natural language processing
PPTX
Visible surface identification
PPTX
Optical character recognition (ocr) ppt
PPTX
Computer Vision - Artificial Intelligence
PPTX
Face recognization
PPTX
Io t system management with
PPTX
Optical Character Recognition
Scaling and shearing
Face recognition attendance system
TensorFlow
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...
Artificial intelligence in speech recognition
IEEE EED2021 AI use cases in Computer Vision
Optical Character Recognition( OCR )
Natural language processing (NLP) introduction
Deep Learning for Natural Language Processing
Speech recognition system seminar
Language translation with Deep Learning (RNN) with TensorFlow
 
Language translator
Computer vision
Natural language processing
Visible surface identification
Optical character recognition (ocr) ppt
Computer Vision - Artificial Intelligence
Face recognization
Io t system management with
Optical Character Recognition
Ad

Viewers also liked (20)

PPTX
OCR using Tesseract
PPTX
Tamil OCR using Tesseract OCR Engine
PPTX
Tasract OCR
PPT
Tesseract OCR Engine - OpenFest 2009
PPTX
Text Detection and Recognition
PDF
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
PPTX
基于Python构建可扩展的自动化运维平台
PPTX
Introduction to python for Beginners
PPT
Introduction to Python
PDF
Scalable OCR with NiFi and Tesseract
PPTX
Optical Character Recognition (OCR)
PPTX
Python 101: Python for Absolute Beginners (PyTexas 2014)
PPTX
Basics of-optical-character-recognition
PPT
optical character recognition system
PPT
Raspberry pi
ODP
Python Presentation
PDF
Déposer une thèse dans TEL ou HAL
PPT
Introduction to Python
PPTX
Slideshare ppt
OCR using Tesseract
Tamil OCR using Tesseract OCR Engine
Tasract OCR
Tesseract OCR Engine - OpenFest 2009
Text Detection and Recognition
As Ict (Ocr) G061 3.1.6 Application Software used for the Presentation & Comm...
基于Python构建可扩展的自动化运维平台
Introduction to python for Beginners
Introduction to Python
Scalable OCR with NiFi and Tesseract
Optical Character Recognition (OCR)
Python 101: Python for Absolute Beginners (PyTexas 2014)
Basics of-optical-character-recognition
optical character recognition system
Raspberry pi
Python Presentation
Déposer une thèse dans TEL ou HAL
Introduction to Python
Slideshare ppt
Ad

Similar to Tesseract OCR Engine (20)

PDF
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
PPTX
Utilizing the Pre-trained Model Effectively for Speech Translation
PDF
Enterprise messaging
PDF
Open Source SQL Databases
PPTX
ReFRESCO-General-Jan2015
KEY
From legacy, to batch, to near real-time
PDF
CBDW2014 - Down the RabbitMQ hole with ColdFusion
KEY
From legacy, to batch, to near real-time
KEY
Verification with LoLA: 1 Basics
PPT
PPL unit 1 syntax and semantics- evolution of programming language lexical an...
PDF
Building a Neural Machine Translation System From Scratch
PDF
Introduction to libre « fulltext » technology
PDF
Getting Deep on Orchestration - Nickoloff - DockerCon16
PPTX
OCR by Abdullah Ahmed Abu Rtima
PDF
Modern software architectures - PHP UK Conference 2015
PPTX
2018 12-kube con-ballerinacon
KEY
Whole Platform LWC11 Submission
PDF
VoltDB and Erlang - Tech planet 2012
PPTX
Hunting for anglerfish in datalakes
PDF
STORMPresentation and all about storm_FINAL.pdf
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Utilizing the Pre-trained Model Effectively for Speech Translation
Enterprise messaging
Open Source SQL Databases
ReFRESCO-General-Jan2015
From legacy, to batch, to near real-time
CBDW2014 - Down the RabbitMQ hole with ColdFusion
From legacy, to batch, to near real-time
Verification with LoLA: 1 Basics
PPL unit 1 syntax and semantics- evolution of programming language lexical an...
Building a Neural Machine Translation System From Scratch
Introduction to libre « fulltext » technology
Getting Deep on Orchestration - Nickoloff - DockerCon16
OCR by Abdullah Ahmed Abu Rtima
Modern software architectures - PHP UK Conference 2015
2018 12-kube con-ballerinacon
Whole Platform LWC11 Submission
VoltDB and Erlang - Tech planet 2012
Hunting for anglerfish in datalakes
STORMPresentation and all about storm_FINAL.pdf

More from Raghu nath (20)

PPTX
Mongo db
PDF
Ftp (file transfer protocol)
PDF
MS WORD 2013
PDF
Msword
PDF
Ms word
PDF
Javascript part1
PDF
Regular expressions
PDF
Selection sort
PPTX
Binary search
PPTX
JSON(JavaScript Object Notation)
PDF
Stemming algorithms
PPTX
Step by step guide to install dhcp role
PPTX
Network essentials chapter 4
PPTX
Network essentials chapter 3
PPTX
Network essentials chapter 2
PPTX
Network essentials - chapter 1
PPTX
Python chapter 2
PPTX
python chapter 1
PPTX
Linux Shell Scripting
PPTX
Mongo db
Ftp (file transfer protocol)
MS WORD 2013
Msword
Ms word
Javascript part1
Regular expressions
Selection sort
Binary search
JSON(JavaScript Object Notation)
Stemming algorithms
Step by step guide to install dhcp role
Network essentials chapter 4
Network essentials chapter 3
Network essentials chapter 2
Network essentials - chapter 1
Python chapter 2
python chapter 1
Linux Shell Scripting

Recently uploaded (20)

PDF
RMMM.pdf make it easy to upload and study
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Pre independence Education in Inndia.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
01-Introduction-to-Information-Management.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Insiders guide to clinical Medicine.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
RMMM.pdf make it easy to upload and study
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Supply Chain Operations Speaking Notes -ICLT Program
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Pre independence Education in Inndia.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Final Presentation General Medicine 03-08-2024.pptx
human mycosis Human fungal infections are called human mycosis..pptx
TR - Agricultural Crops Production NC III.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
102 student loan defaulters named and shamed – Is someone you know on the list?
01-Introduction-to-Information-Management.pdf
Classroom Observation Tools for Teachers
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Insiders guide to clinical Medicine.pdf
PPH.pptx obstetrics and gynecology in nursing
Sports Quiz easy sports quiz sports quiz
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf

Tesseract OCR Engine

  • 2. Contents • Introduction & history of OCR • Tesseract architecture & methods • Announcing Tesseract 2.00 • Training Tesseract • Future enhancements
  • 3. A Brief History of OCR • What is Optical Character Recognition? • A Brief History of OCR • OCR predates electronic computers!
  • 5. A Brief History of OCR • 1929 – Digit recognition machine • 1953 – Alphanumeric recognition machine • 1965 – US Mail sorting • 1965 – British banking system • 1976 – Kurzweil reading machine • 1985 – Hardware-assisted PC software • 1988 – Software-only PC software • 1994-2000 – Industry consolidation
  • 6. Tesseract Background • Developed on HP-UX at HP between 1985 • and 1994 to run in a desktop scanner. • Came neck and neck with Caere and XIS • in the 1995 UNLV test. • (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) • Never used in an HP product. • Open sourced in 2005. Now on: • http://code.google.com/p/tesseract-ocr • Highly portable.
  • 7. Tesseract OCR Architecture • Baselines are rarely perfectly straight • Text Line Finding – skew independent – • published at ICDAR’95 Montreal. • (http://scholar.google.com/scholar?q=skew+detection+smith) • Baselines are approximated by quadratic splines • to account for skew and curl. • Meanline, ascender and descender lines are a • constant displacement from baseline. • Critical value is the x-height.
  • 8. Spaces between words are tricky too • Italics, digits, punctuation all create special-case font-dependent spacing. • • Fully justified text in narrow columns can have vastly varying spacing on different lines. • Tesseract: Recognize Word • Outline Approximation • Polygonal approximation is a double-edged sword. • Noise and some pertinent information are both lost.
  • 9. Tesseract: Features and Matching • Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) • Adaptive classifier uses the same technique! • (Apart from normalization method)
  • 10. Announcing tesseract-2.00 • Fully Unicode (UTF-8) capable • Already trained for 6 Latin-based • languages (Eng, Fra, Ita, Deu, Spa, Nld) • Code and documented process to train at • http://code.google.com/p/tesseract-ocr • UNLV regression test framework • Other minor fixes
  • 11. Commercial OCR v Tesseract • Page layout analysis. • More languages. • Improve accuracy. • Add a UI.