SlideShare a Scribd company logo
IMPACT Research Image Enhancement, Segmentation, Experimental OCR Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
Outline Overview: digitisation workflow Image enhancement Border removal Page curl removal Correction of  arbitrary warping Segmentation Recognition-based Standalone Typewritten document OCR Wordspotting
Overview: Digitisation Workflow Main steps: Scanning Image enhancement Page splitting Border removal Page curl removal Dewarping Layout analysis Segmentation of regions, lines, words  and characters Region classification Logical layout analysis OCR  (incl. specialist or wordspotting) Post-processing
Textline and Word Segmentation Standalone methods that can be integrated to systems without the need to integrate FR engine Not based on recognition of characters/words – suitable for documents with non-dictionary words or not practical to OCR to OCR (word spotting) Used in other IMPACT methods: Typewritten OCR Correction of arbitrary warping Word spotting date footertext
Hybrid Text Line Segmenter Hybrid approach based on connected component clustering and projection profiles Connected component extraction (incl. noise filtering) Group components into line candidates using an efficient data structure Find and split under-segmented lines using local projection profiles Merge small peripheral lines to appropriate neighbour (e.g. for i-dots etc.) Bitonal image Text regions (PAGE XML)  Regions with text lines  (PAGE XML)  Parameters
Density Word Segmenter Adaptive projection-profile based approach using foreground pixel density Bitonal image Text regions and lines (PAGE XML)  Regions, text lines and words (PAGE XML)  Parameters  For each text line: Generate vertical projection profile Find delimiting white spaces using an adaptive threshold based on the density of foreground pixels in the line Group connected components into words
Evaluation Text line ground truth: 25 historical documents (more than 2700 text lines) Results (using USAL layout evaluation tool): Word ground truth: 15 historical documents (more than 14500 words) Results (using USAL layout evaluation tool):
Further Information PRImA http://www.primaresearch.org IMPACT http://www.impact-project.eu

More Related Content

PPT
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
PPTX
WF ED 540, CLASS MEETING 4, Structure of ggplot2 coding, 2016
PPTX
Explorer of Taxon Concepts (ETC). From description to matrix and beyond in a ...
PPTX
IR tutorial
PDF
Au 2008 Gs100 1 P Getting Spatial With
PPTX
3DIR: Exploiting Topological Relationships in Three-dimensional Information R...
PDF
Best practices in digitisation - Tomasz Parkola
PDF
Succeed Evaluation Infrastructure - Apostolos Antonacopoulos
IMPACT Final Conference - Research Parallel Sessions - 01 impact conference_r...
WF ED 540, CLASS MEETING 4, Structure of ggplot2 coding, 2016
Explorer of Taxon Concepts (ETC). From description to matrix and beyond in a ...
IR tutorial
Au 2008 Gs100 1 P Getting Spatial With
3DIR: Exploiting Topological Relationships in Three-dimensional Information R...
Best practices in digitisation - Tomasz Parkola
Succeed Evaluation Infrastructure - Apostolos Antonacopoulos

Similar to IMPACT Final Conference - USAL - Text line and word segmentation (20)

PPT
IMPACT Final Conference - USAL - Arbitrary warping
PPTX
Texture features based text extraction from images using DWT and K-means clus...
PPT
Searching Repositories of Web Application Models
PPTX
Annotating Search Results from Web Databases
DOCX
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
PDF
Design and Description of Feature Extraction Algorithm for Old English Font
PPT
Model-Driven Design of Audiovisual Indexing Processes for Search Apps.
PPTX
Optimization of Incremental Queries CloudMDE2015
PDF
EPLAN-Function Overview
PPT
Automated Syntactic Mediation for Web Service Integration
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
A12REVIEW.pptx
PPTX
Optical Character Recognition
PDF
“Semantic PDF Processing & Document Representation”
PDF
Text Extraction System by Eliminating Non-Text Regions
PDF
Enhancement and Segmentation of Historical Records
PPT
Ontology-based Cooperation of Information Systems
PDF
Bag of Visual Words for Word Spotting in Handwritten Documents Based on Curva...
PDF
BAG OF VISUAL WORDS FOR WORD SPOTTING IN HANDWRITTEN DOCUMENTS BASED ON CURVA...
PDF
Bag of Visual Words for Word Spotting in Handwritten Documents Based on Curva...
IMPACT Final Conference - USAL - Arbitrary warping
Texture features based text extraction from images using DWT and K-means clus...
Searching Repositories of Web Application Models
Annotating Search Results from Web Databases
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
Design and Description of Feature Extraction Algorithm for Old English Font
Model-Driven Design of Audiovisual Indexing Processes for Search Apps.
Optimization of Incremental Queries CloudMDE2015
EPLAN-Function Overview
Automated Syntactic Mediation for Web Service Integration
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
A12REVIEW.pptx
Optical Character Recognition
“Semantic PDF Processing & Document Representation”
Text Extraction System by Eliminating Non-Text Regions
Enhancement and Segmentation of Historical Records
Ontology-based Cooperation of Information Systems
Bag of Visual Words for Word Spotting in Handwritten Documents Based on Curva...
BAG OF VISUAL WORDS FOR WORD SPOTTING IN HANDWRITTEN DOCUMENTS BASED ON CURVA...
Bag of Visual Words for Word Spotting in Handwritten Documents Based on Curva...
Ad

More from IMPACT Centre of Competence (20)

PDF
Session6 01.helmut schmid
PDF
Session1 03.hsian-an wang
PDF
Session7 03.katrien depuydt
PDF
Session7 02.peter kiraly
PDF
Session6 04.giuseppe celano
PDF
Session6 03.sandra young
PDF
Session6 02.jeremi ochab
PDF
Session5 04.evangelos varthis
PDF
Session5 03.george rehm
PDF
Session5 02.tom derrick
PDF
Session5 01.rutger vankoert
PDF
Session4 04.senka drobac
PDF
Session3 04.arnau baro
PDF
Session3 03.christian clausner
PDF
Session3 02.kimmo ketunnen
PDF
Session3 01.clemens neudecker
PDF
Session2 04.ashkan ashkpour
PDF
Session2 03.juri opitz
PDF
Session2 02.christian reul
PDF
Session2 01.emad mohamed
Session6 01.helmut schmid
Session1 03.hsian-an wang
Session7 03.katrien depuydt
Session7 02.peter kiraly
Session6 04.giuseppe celano
Session6 03.sandra young
Session6 02.jeremi ochab
Session5 04.evangelos varthis
Session5 03.george rehm
Session5 02.tom derrick
Session5 01.rutger vankoert
Session4 04.senka drobac
Session3 04.arnau baro
Session3 03.christian clausner
Session3 02.kimmo ketunnen
Session3 01.clemens neudecker
Session2 04.ashkan ashkpour
Session2 03.juri opitz
Session2 02.christian reul
Session2 01.emad mohamed
Ad

Recently uploaded (20)

PDF
A review of recent deep learning applications in wood surface defect identifi...
DOCX
search engine optimization ppt fir known well about this
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Hybrid model detection and classification of lung cancer
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
What is a Computer? Input Devices /output devices
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Unlock new opportunities with location data.pdf
PPTX
Modernising the Digital Integration Hub
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
STKI Israel Market Study 2025 version august
A review of recent deep learning applications in wood surface defect identifi...
search engine optimization ppt fir known well about this
Developing a website for English-speaking practice to English as a foreign la...
Getting started with AI Agents and Multi-Agent Systems
Module 1.ppt Iot fundamentals and Architecture
Hybrid model detection and classification of lung cancer
Zenith AI: Advanced Artificial Intelligence
Web Crawler for Trend Tracking Gen Z Insights.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Enhancing emotion recognition model for a student engagement use case through...
What is a Computer? Input Devices /output devices
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
observCloud-Native Containerability and monitoring.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Unlock new opportunities with location data.pdf
Modernising the Digital Integration Hub
A novel scalable deep ensemble learning framework for big data classification...
STKI Israel Market Study 2025 version august

IMPACT Final Conference - USAL - Text line and word segmentation

  • 1. IMPACT Research Image Enhancement, Segmentation, Experimental OCR Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
  • 2. Outline Overview: digitisation workflow Image enhancement Border removal Page curl removal Correction of arbitrary warping Segmentation Recognition-based Standalone Typewritten document OCR Wordspotting
  • 3. Overview: Digitisation Workflow Main steps: Scanning Image enhancement Page splitting Border removal Page curl removal Dewarping Layout analysis Segmentation of regions, lines, words and characters Region classification Logical layout analysis OCR (incl. specialist or wordspotting) Post-processing
  • 4. Textline and Word Segmentation Standalone methods that can be integrated to systems without the need to integrate FR engine Not based on recognition of characters/words – suitable for documents with non-dictionary words or not practical to OCR to OCR (word spotting) Used in other IMPACT methods: Typewritten OCR Correction of arbitrary warping Word spotting date footertext
  • 5. Hybrid Text Line Segmenter Hybrid approach based on connected component clustering and projection profiles Connected component extraction (incl. noise filtering) Group components into line candidates using an efficient data structure Find and split under-segmented lines using local projection profiles Merge small peripheral lines to appropriate neighbour (e.g. for i-dots etc.) Bitonal image Text regions (PAGE XML) Regions with text lines (PAGE XML) Parameters
  • 6. Density Word Segmenter Adaptive projection-profile based approach using foreground pixel density Bitonal image Text regions and lines (PAGE XML) Regions, text lines and words (PAGE XML) Parameters For each text line: Generate vertical projection profile Find delimiting white spaces using an adaptive threshold based on the density of foreground pixels in the line Group connected components into words
  • 7. Evaluation Text line ground truth: 25 historical documents (more than 2700 text lines) Results (using USAL layout evaluation tool): Word ground truth: 15 historical documents (more than 14500 words) Results (using USAL layout evaluation tool):
  • 8. Further Information PRImA http://www.primaresearch.org IMPACT http://www.impact-project.eu