SlideShare a Scribd company logo
1
Text Extraction from Product Images
Rajesh Shreedhar Bhat
Data Scientist @Walmart Labs, Bengaluru
MS CS @ASU, Kaggle Competitions Expert
Agenda
▪ Intro to Text Extraction
▪ Text Detection(TD)
▪ TD Model Architecture
▪ Training data generation
▪ Text Recognition(TR) training data
preparation
▪ CRNN-CTC model for TR
▪ Receptive Fields
▪ CTC decoder and loss
▪ TR Training phase
▪ Other Advanced Techniques
▪ Questions ?
3
Introduction : Text Extraction
4
Text Detection
5
Image Segmentation - Input & Ground Truth
6
Ground Truth Label Generation
7
Text Detection – Model architecture
▪ VGG16 – BN as the backbone
▪ Model has skip connection in decoder
part which is similar to U-Nets.
▪ Output :
▪ Region score
▪ Affinity score - grouping
characters
Ref:Baek, Youngmin, et al. "Character Region
Awareness for Text detection." Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition. 2019.
8
Sample Output
Region Score Affinity Score Region Score Affinity Score
9
Sample Output..
10
Text Recognition
11
Text Recognition – Training Data Preparation
SynthText: image generation engine
for building a large annotated dataset.
15 million images generated with
different font styles, size, color &
varying backgrounds using product
descriptions + open source datasets
Vocabulary: 92 characters
Includes capital + small letters,
numbers and special symbols
12
Text Recognition CRNN CTC model
13
CNN - Receptive Fields
Usage of intermediate layer features in
SSD’s in Object detection tasks.
▪ Receptive field is defined as the region in the input image/space that a
particular CNN’s feature is looking at.
14
CNN features to LSTM
Input Image
(None, 128 * 32 * 1)
Feature map
(None, 1, 31, 512)
Feature map
(None, 1, 31, 512)
SoftMax probabilities for every
time step (i.e. 31), over the
vocabulary.
15
Ground Truth and Output of TR task
16
Hello
Input Image Ground Truth
Output from LSTM model
for 31 timesteps ..
Time step t1 t2 t3 t4 t5 ... …. t27 t28 t29 t30 t31
Prediction H H H e e ... ... l o o o o
Length of Ground truth is 5 which is not equal to length of prediction i.e 31
How to calculate the loss?
NER model – loss: categorical cross entropy
• Do we have labels for
every time steps of
LSTM model in CRNN
setting ?
• Can we use cross
entropy loss?
Answer is: NO!!
17
Mapping of Input to Output
Corresponding Text
Hello
Corresponding Text
Hello
Can we manually align each character to its location in the audio/image?
Yes!! But lot of manual effort is needed in creating training data.
18
CTC to rescue
With just mapping from image to text and not worrying about alignment of each
character to the location in input image, one should be able to train the network.
Merge repeats
Merge repeats
Drop blank character
19
Connectionist Temporal Classification (CTC) Loss
▪ Ground truth for an image --> AB
▪ Vocabulary is { A, B, - }
▪ Let's say we have predictions for 3-time steps from LSTM network (SoftMax
probabilities over vocabulary at t1, t2, t3)
▪ Given that we use CTC decode operation discussed earlier, in which scenarios we
can say output from the model is correct??
20
CTC loss continued ..
t1 t2 t3
A B B
A A B
- A B
A - B
A B -
• Merge
repeats
• Drop blank
character
AB
t1 t2 t3
A 0.8 0.7 0.1
B 0.1 0.1 0.8
- 0.1 0.2 0.1
Ground Truth : AB
Score for one path: AAB = (0.8 * 0.7 * 0.8) and similarly for other paths.
Probability of getting GT AB: = P(ABB) + P(AAB) + P(-AB) + P(A-B) + P(AB-)
Loss : - log( Probability of getting GT )
SoftMax probabilities
21
CTC loss perfect match
t1 t2 t3
A - -
- A -
- - A
- A A
A A -
A - A
A A A
• Merge
repeats
• Drop blank
character
t1 t2 t3
A 1 0 0
B 0 0 0
- 0 1 1
• SoftMax probabilities
Score for one path: A- - = (1 * 1 * 1) and similarly for other paths.
Probability of getting ground truth A: = P(A--) + P(-A-) + P(--A) + P(-AA) + P(AA-) + P(A-A) + P(AAA)
Loss : - log( Probability of getting GT ) = 0
Ground Truth : A
A
22
CTC loss perfect mismatch
• Merge
repeats
• Drop blank
character
A
t1 t2 t3
A 0 0 0
B 1 1 1
- 0 0 0
SoftMax probabilities
Score for one path: A - - = (0 * 0 * 0) and similarly for other paths.
Probability of getting ground truth A: = P(A--) + P(-A-) + P(--A) + P(-AA) + P(AA-) + P(A-A) + P(AAA)
Loss : - log( Probability of getting GT ) = tends to infinity !!
Ground Truth : A
t1 t2 t3
A - -
- A -
- - A
- A A
A A -
A - A
A A A
23
Model Architecture & CTC loss in TF
24
Training Phase
▪ 15 million images ~ 690 GB when loaded into memory!! Given that on an average
images are of the shape (128 * 32 * 3) and dtype is float32.
▪ Usage of Python Generators to load only single batch in memory.
▪ Reducing the training time by using workers, max_queue_size & multi-processing
in .fit_generator in Keras.
▪ Training time ~ 2 hours for single epoch on single P100 GPU machine and
prediction time ~1 sec for batch of 2048 images.
25
Other Advanced Techniques
Attention - OCR Spatial Transformer Network – before text recognition
Ref:Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer
networks." Advances in neural information processing systems. 2015.
26
Code + PPT
https://github.com/rajesh-bhat/spark-ai-summit-2020-text-extraction
27
Questions ??
rsbhat@asu.edu
https://www.linkedin.com/in/rajeshshreedhar
28

More Related Content

PPTX
Face detection and recognition
PPTX
Front-End Web Development
PPTX
Introduction to Artificial Intelligence
DOCX
robotic arm
PDF
9 Examples of Artificial Intelligence in Use Today
PDF
Front End Development
PPTX
Pattern recognition facial recognition
PPTX
Face recognisation system
Face detection and recognition
Front-End Web Development
Introduction to Artificial Intelligence
robotic arm
9 Examples of Artificial Intelligence in Use Today
Front End Development
Pattern recognition facial recognition
Face recognisation system

What's hot (20)

PPT
Javascript
PPTX
Introduction to Robotic Process Automation by K.G.Maheshwari
PPTX
Artificial Intelligence
PPTX
PPTX
Web Design Trends: 2018 Edition
PPTX
Carrer in cse In Bangladesh
PPTX
Front end web development
PPT
50409621003 fingerprint recognition system-ppt
PPT
Web development | Derin Dolen
PPTX
Pick and Place Robotic Arm 1
PPTX
PPTX
Sixth Sense Technology
PPTX
Introduction to AI and its domains.pptx
PPTX
Web designing
PPTX
WEB DEVELOPMENT
PDF
Progressive Web Apps
PPTX
Face recognition technology
PPTX
Applications of Artificial Intelligence
PPTX
Artificial intelligence
PPTX
Introduction to android
Javascript
Introduction to Robotic Process Automation by K.G.Maheshwari
Artificial Intelligence
Web Design Trends: 2018 Edition
Carrer in cse In Bangladesh
Front end web development
50409621003 fingerprint recognition system-ppt
Web development | Derin Dolen
Pick and Place Robotic Arm 1
Sixth Sense Technology
Introduction to AI and its domains.pptx
Web designing
WEB DEVELOPMENT
Progressive Web Apps
Face recognition technology
Applications of Artificial Intelligence
Artificial intelligence
Introduction to android
Ad

Similar to Text Extraction from Product Images Using State-of-the-Art Deep Learning Techniques (20)

PDF
Slides_for_Spark_AI_TE.pdf
PDF
Detecting and Recognising Highly Arbitrary Shaped Texts from Product Images
PPTX
Intelligent Handwriting Recognition_MIL_presentation_v3_final
PPTX
Tutorial on deep transformer themed “Gemini family”
PDF
Trinity of AI: data, algorithms and cloud
PDF
Real Time Sign Language Recognition Using Deep Learning
PPTX
Generating super resolution images using transformers
PPTX
cnn ppt.pptx
PPTX
Tutorial on deep transformer (presentation slides)
PDF
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
PDF
IRJET-MText Extraction from Images using Convolutional Neural Network
PPTX
Long and short term memory presesntation
PDF
Text cnn on acme ugc moderation
PDF
ML in Android
PPTX
CNN and its applications by ketaki
PDF
Text and Object Recognition using Deep Learning for Visually Impaired People
PPTX
Deep Learning and Watson Studio
PDF
Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlow
PDF
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
Slides_for_Spark_AI_TE.pdf
Detecting and Recognising Highly Arbitrary Shaped Texts from Product Images
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Tutorial on deep transformer themed “Gemini family”
Trinity of AI: data, algorithms and cloud
Real Time Sign Language Recognition Using Deep Learning
Generating super resolution images using transformers
cnn ppt.pptx
Tutorial on deep transformer (presentation slides)
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
IRJET-MText Extraction from Images using Convolutional Neural Network
Long and short term memory presesntation
Text cnn on acme ugc moderation
ML in Android
CNN and its applications by ketaki
Text and Object Recognition using Deep Learning for Visually Impaired People
Deep Learning and Watson Studio
Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlow
Animesh Prasad and Muthu Kumar Chandrasekaran - WESST - Basics of Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Transcultural that can help you someday.
PPTX
Database Infoormation System (DBIS).pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Introduction to Data Science and Data Analysis
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Predictive modeling basics in data cleaning process
Galatica Smart Energy Infrastructure Startup Pitch Deck
A Complete Guide to Streamlining Business Processes
Transcultural that can help you someday.
Database Infoormation System (DBIS).pptx
modul_python (1).pptx for professional and student
Introduction-to-Cloud-ComputingFinal.pptx
annual-report-2024-2025 original latest.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
importance of Data-Visualization-in-Data-Science. for mba studnts
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Introduction to Data Science and Data Analysis
STERILIZATION AND DISINFECTION-1.ppthhhbx
SAP 2 completion done . PRESENTATION.pptx
Leprosy and NLEP programme community medicine
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Data_Analytics_and_PowerBI_Presentation.pptx
Predictive modeling basics in data cleaning process

Text Extraction from Product Images Using State-of-the-Art Deep Learning Techniques

  • 1. 1
  • 2. Text Extraction from Product Images Rajesh Shreedhar Bhat Data Scientist @Walmart Labs, Bengaluru MS CS @ASU, Kaggle Competitions Expert
  • 3. Agenda ▪ Intro to Text Extraction ▪ Text Detection(TD) ▪ TD Model Architecture ▪ Training data generation ▪ Text Recognition(TR) training data preparation ▪ CRNN-CTC model for TR ▪ Receptive Fields ▪ CTC decoder and loss ▪ TR Training phase ▪ Other Advanced Techniques ▪ Questions ? 3
  • 4. Introduction : Text Extraction 4
  • 6. Image Segmentation - Input & Ground Truth 6
  • 7. Ground Truth Label Generation 7
  • 8. Text Detection – Model architecture ▪ VGG16 – BN as the backbone ▪ Model has skip connection in decoder part which is similar to U-Nets. ▪ Output : ▪ Region score ▪ Affinity score - grouping characters Ref:Baek, Youngmin, et al. "Character Region Awareness for Text detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. 8
  • 9. Sample Output Region Score Affinity Score Region Score Affinity Score 9
  • 12. Text Recognition – Training Data Preparation SynthText: image generation engine for building a large annotated dataset. 15 million images generated with different font styles, size, color & varying backgrounds using product descriptions + open source datasets Vocabulary: 92 characters Includes capital + small letters, numbers and special symbols 12
  • 13. Text Recognition CRNN CTC model 13
  • 14. CNN - Receptive Fields Usage of intermediate layer features in SSD’s in Object detection tasks. ▪ Receptive field is defined as the region in the input image/space that a particular CNN’s feature is looking at. 14
  • 15. CNN features to LSTM Input Image (None, 128 * 32 * 1) Feature map (None, 1, 31, 512) Feature map (None, 1, 31, 512) SoftMax probabilities for every time step (i.e. 31), over the vocabulary. 15
  • 16. Ground Truth and Output of TR task 16 Hello Input Image Ground Truth Output from LSTM model for 31 timesteps .. Time step t1 t2 t3 t4 t5 ... …. t27 t28 t29 t30 t31 Prediction H H H e e ... ... l o o o o Length of Ground truth is 5 which is not equal to length of prediction i.e 31
  • 17. How to calculate the loss? NER model – loss: categorical cross entropy • Do we have labels for every time steps of LSTM model in CRNN setting ? • Can we use cross entropy loss? Answer is: NO!! 17
  • 18. Mapping of Input to Output Corresponding Text Hello Corresponding Text Hello Can we manually align each character to its location in the audio/image? Yes!! But lot of manual effort is needed in creating training data. 18
  • 19. CTC to rescue With just mapping from image to text and not worrying about alignment of each character to the location in input image, one should be able to train the network. Merge repeats Merge repeats Drop blank character 19
  • 20. Connectionist Temporal Classification (CTC) Loss ▪ Ground truth for an image --> AB ▪ Vocabulary is { A, B, - } ▪ Let's say we have predictions for 3-time steps from LSTM network (SoftMax probabilities over vocabulary at t1, t2, t3) ▪ Given that we use CTC decode operation discussed earlier, in which scenarios we can say output from the model is correct?? 20
  • 21. CTC loss continued .. t1 t2 t3 A B B A A B - A B A - B A B - • Merge repeats • Drop blank character AB t1 t2 t3 A 0.8 0.7 0.1 B 0.1 0.1 0.8 - 0.1 0.2 0.1 Ground Truth : AB Score for one path: AAB = (0.8 * 0.7 * 0.8) and similarly for other paths. Probability of getting GT AB: = P(ABB) + P(AAB) + P(-AB) + P(A-B) + P(AB-) Loss : - log( Probability of getting GT ) SoftMax probabilities 21
  • 22. CTC loss perfect match t1 t2 t3 A - - - A - - - A - A A A A - A - A A A A • Merge repeats • Drop blank character t1 t2 t3 A 1 0 0 B 0 0 0 - 0 1 1 • SoftMax probabilities Score for one path: A- - = (1 * 1 * 1) and similarly for other paths. Probability of getting ground truth A: = P(A--) + P(-A-) + P(--A) + P(-AA) + P(AA-) + P(A-A) + P(AAA) Loss : - log( Probability of getting GT ) = 0 Ground Truth : A A 22
  • 23. CTC loss perfect mismatch • Merge repeats • Drop blank character A t1 t2 t3 A 0 0 0 B 1 1 1 - 0 0 0 SoftMax probabilities Score for one path: A - - = (0 * 0 * 0) and similarly for other paths. Probability of getting ground truth A: = P(A--) + P(-A-) + P(--A) + P(-AA) + P(AA-) + P(A-A) + P(AAA) Loss : - log( Probability of getting GT ) = tends to infinity !! Ground Truth : A t1 t2 t3 A - - - A - - - A - A A A A - A - A A A A 23
  • 24. Model Architecture & CTC loss in TF 24
  • 25. Training Phase ▪ 15 million images ~ 690 GB when loaded into memory!! Given that on an average images are of the shape (128 * 32 * 3) and dtype is float32. ▪ Usage of Python Generators to load only single batch in memory. ▪ Reducing the training time by using workers, max_queue_size & multi-processing in .fit_generator in Keras. ▪ Training time ~ 2 hours for single epoch on single P100 GPU machine and prediction time ~1 sec for batch of 2048 images. 25
  • 26. Other Advanced Techniques Attention - OCR Spatial Transformer Network – before text recognition Ref:Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in neural information processing systems. 2015. 26