SlideShare a Scribd company logo
BUILDING A NAÏVE OCR
                 SYSTEM
                  Kaur Alasoo and Jaana Metsamaa
                       Data Mining (2009 fall)




28 January 2010                           http://kaur.pri.ee/projects/wordocr/
VISION


   Take a
snapshot of a
    page.
VISION


Extract one
word from
the picture.
VISION




Look up the word
   on Google,
 dictionary, etc...
REALITY
• To   increase the likelihood of success, we limited ourselves to:

 • One     font from one book.

 • Only    lowercase characters.

 • Pictures   taken with one phone.
GETTING THE DATA




 44 pictures
GETTING THE DATA




 44 pictures   242 words
0. Take a picture

                         1. Convert to grayscale

                                                   2. Apply thresholding




  WORKFLOW                                          3. Cut horizontally

                        4. Cut vertically
5. Resize




            8x16 px
FROM PICTURE TO FEATURE
        VECTOR
FROM PICTURE TO FEATURE
        VECTOR
FROM PICTURE TO FEATURE
        VECTOR
       255
FROM PICTURE TO FEATURE
        VECTOR
       255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255 255
FROM PICTURE TO FEATURE
        VECTOR
       255 255 255   0   0   155 255 255
TRAINING THE CLASSIFIER




We used 1343 characters to train a SVM with RBF
                   kernel.
RESULTS

• 10-fold   cross-validation accuracy with our method:

                              98.86%
• Character   classification accuracy using Tesseract OCR:

                              91.47%
WHAT WE LEARNED?
SOME LETTERS ARE
                    EXTREMELY RARE
15

                                                               English             Our data

11



 8



 4



 0
     e   t   a o   i   n   s   h   r   d   l   c   u m w   f    g   y    p b   v   k   j   x q   z
NOT ALL FONTS CAN BE EASILY
SEPARATED INTO CHARACTERS
WHAT WE LEARNED?

• Installing   libraries is the most difficult thing.

• Do    not overly restrict yourself.

• Donot build your on OCR system unless its absolutely
 necessary.

• The   letters are the easy part.


                                                http://kaur.pri.ee/projects/wordocr/

More Related Content

PDF
Spot vs. Reserved Instances
PPTX
Sissejuhatus informaatikasse: kokkuvõttev statistika
PPT
An OCR System for recognition of Urdu text in Nastaliq Font
PDF
PDF
Character Recognition (Devanagari Script)
PDF
Design and Description of Feature Extraction Algorithm for Old English Font
PDF
BLOB DETECTION TECHNIQUE USING IMAGE PROCESSING FOR IDENTIFICATION OF MACHINE...
PDF
50120130406005
Spot vs. Reserved Instances
Sissejuhatus informaatikasse: kokkuvõttev statistika
An OCR System for recognition of Urdu text in Nastaliq Font
Character Recognition (Devanagari Script)
Design and Description of Feature Extraction Algorithm for Old English Font
BLOB DETECTION TECHNIQUE USING IMAGE PROCESSING FOR IDENTIFICATION OF MACHINE...
50120130406005

Similar to Building a Naive OCR System (20)

PDF
IRJET- Intelligent Character Recognition of Handwritten Characters
PDF
Seminar5
PPTX
Pattern_Recognition_via_Character_Recogn.pptx
PDF
IRJET-Optical Character Recognition using ANN
PDF
Os Raysmith
PDF
A Survey on Tamil Handwritten Character Recognition using OCR Techniques
PDF
In tech preprocessing-techniques_in_character_recognition
PDF
IRJET- Optical Character Recognition using Image Processing
PDF
AN EFFICIENT FEATURE EXTRACTION AND CLASSIFICATION OF HANDWRITTEN DIGITS USIN...
PPT
Radial Sector Coding at SCIS & ISIS 08
DOCX
Opticalcharacter recognition
PDF
A Survey Paper on Character Recognition
DOC
Ocr abstract
PDF
O45018291
DOCX
Optical character recognition IEEE Paper Study
PDF
IRJET- Optical Character Recognition using Neural Networks by Classification ...
PPTX
Handwritten and Machine Printed Text Separation in Document Images using the ...
PDF
A Review of Optical Character Recognition System for Recognition of Printed Text
PDF
E017322833
PDF
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
IRJET- Intelligent Character Recognition of Handwritten Characters
Seminar5
Pattern_Recognition_via_Character_Recogn.pptx
IRJET-Optical Character Recognition using ANN
Os Raysmith
A Survey on Tamil Handwritten Character Recognition using OCR Techniques
In tech preprocessing-techniques_in_character_recognition
IRJET- Optical Character Recognition using Image Processing
AN EFFICIENT FEATURE EXTRACTION AND CLASSIFICATION OF HANDWRITTEN DIGITS USIN...
Radial Sector Coding at SCIS & ISIS 08
Opticalcharacter recognition
A Survey Paper on Character Recognition
Ocr abstract
O45018291
Optical character recognition IEEE Paper Study
IRJET- Optical Character Recognition using Neural Networks by Classification ...
Handwritten and Machine Printed Text Separation in Document Images using the ...
A Review of Optical Character Recognition System for Recognition of Printed Text
E017322833
OPTICAL CHARACTER RECOGNITION IN HEALTHCARE
Ad

More from Kaur Alasoo (6)

PDF
The Data-Driven World
KEY
What is Bioinformatics?
PDF
Masinõpe ja bioinformaatika
KEY
Teine tuutoritund 2009
KEY
Esimene tuutoritund 2009
PPT
Vietnami sõda
The Data-Driven World
What is Bioinformatics?
Masinõpe ja bioinformaatika
Teine tuutoritund 2009
Esimene tuutoritund 2009
Vietnami sõda
Ad

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Encapsulation theory and applications.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Programs and apps: productivity, graphics, security and other tools
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
TLE Review Electricity (Electricity).pptx
DP Operators-handbook-extract for the Mautical Institute
Group 1 Presentation -Planning and Decision Making .pptx
Web App vs Mobile App What Should You Build First.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Tartificialntelligence_presentation.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Hybrid model detection and classification of lung cancer
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Encapsulation theory and applications.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Programs and apps: productivity, graphics, security and other tools

Building a Naive OCR System