SlideShare a Scribd company logo
7
Most read
8
Most read
11
Most read
Scalable	
  OCR	
  With	
  
NiFi	
  &	
  Tesseract	
  
Casey	
  Stella	
  &	
  Michael	
  Miklavcic	
  
2	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Introduc>on	
  
Ã  Casey	
  Stella	
  
–  Currently	
  a	
  data	
  scienAst	
  on	
  Apache	
  Metron	
  
–  Previously	
  Architect	
  in	
  Hortonworks	
  Professional	
  Services	
  
Ã  Michael	
  Miklavcic	
  
–  Currently	
  an	
  engineer	
  on	
  Apache	
  Metron	
  
–  Previously	
  Architect	
  in	
  Hortonworks	
  Professional	
  Services	
  
About	
  the	
  Speakers	
  
3	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
OCR	
  At	
  Scale:	
  The	
  Challenge	
  
Ã  Unstructured	
  data	
  is	
  growing	
  aggressively	
  
Ã  Much	
  of	
  this	
  data	
  is	
  in	
  the	
  form	
  of	
  PDF	
  images	
  of	
  text	
  
–  This	
  appears	
  to	
  be	
  the	
  case	
  inside	
  of	
  organizaAons	
  much	
  more	
  than	
  on	
  the	
  internet	
  
Ã  There	
  is	
  much	
  we	
  can	
  do	
  to	
  extract	
  meaning	
  from	
  this	
  
–  NLP	
  is	
  one	
  of	
  our	
  most	
  mature	
  and	
  rich	
  branches	
  of	
  machine	
  learning	
  
–  Simple	
  textual	
  analysis	
  would	
  be	
  sufficient	
  to	
  have	
  rich	
  insights	
  
Ã  OCR	
  enables	
  us	
  to	
  extract	
  textual	
  informaAon	
  from	
  images	
  in	
  an	
  intelligent	
  way	
  
	
  
4	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
OCR	
  At	
  Scale:	
  Use-­‐cases	
  in	
  Medicine	
  
Ã  The	
  Problem	
  
–  Radiologists	
  make	
  notes	
  about	
  paAents	
  
–  Doctors	
  interpret	
  these	
  notes	
  and	
  make	
  diagnoses	
  based	
  on	
  the	
  radiologist	
  findings	
  
–  SomeAmes,	
  the	
  radiologists	
  find	
  things	
  that	
  are	
  serendipitous	
  or	
  are	
  not	
  definiAve.	
  
Ã  The	
  Value	
  ProposiAon	
  
–  Building	
  a	
  data	
  pipeline	
  at	
  scale	
  to	
  analyze	
  radiologist	
  reports	
  and	
  look	
  for	
  indicaAons	
  of	
  missed	
  
diagnoses	
  
–  This	
  is	
  correct	
  place	
  for	
  advanced	
  analyAcs:	
  in	
  the	
  loop	
  with	
  humans	
  
	
  
5	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
OCR	
  At	
  Scale:	
  Use-­‐cases	
  in	
  Journalism	
  
Ã  The	
  Problem	
  
–  Journalists	
  are	
  now	
  asked	
  to	
  analyze	
  large	
  volumes	
  of	
  data	
  
–  The	
  Panama	
  Papers	
  alone	
  were	
  2.6TB	
  of	
  data,	
  much	
  of	
  it	
  in	
  scanned	
  images	
  of	
  pages	
  
–  FOIA	
  requests	
  can	
  quickly	
  outstrip	
  the	
  reading	
  capability	
  of	
  a	
  single	
  person	
  or	
  team	
  
Ã  The	
  Value	
  ProposiAon	
  
–  Building	
  a	
  scalable	
  data	
  pipeline	
  to	
  extract	
  the	
  text	
  from	
  the	
  data	
  journalists	
  are	
  asked	
  to	
  mine	
  
enables	
  more	
  advanced	
  analyAcs	
  and	
  be]er	
  reporAng.	
  
–  This	
  is	
  a	
  tool	
  to	
  enable	
  be]er	
  journalism	
  
6	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Methodology	
  :	
  OCR	
  
Ã  Conversion	
  
–  Take	
  PDF’s	
  and	
  turn	
  them	
  into	
  TIFF	
  files,	
  page-­‐wise	
  
–  GhostScript	
  via	
  Ghost4j	
  
Ã  Preprocessing	
  
–  Prepare	
  images	
  by	
  enhancing	
  text	
  and	
  cleaning	
  up	
  arAfacts	
  
–  Enable	
  cleaner	
  text	
  extracAon	
  
–  A	
  preprocessing	
  pipeline	
  using	
  ImageMagick	
  under	
  the	
  hood	
  
Ã  ExtracAon	
  
–  OCR	
  phase	
  using	
  Tesseract	
  
7	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Image	
  Preprocessing	
  
Ã  ImageMagick	
  is	
  a	
  standard	
  open	
  source	
  library	
  and	
  tool	
  to	
  do	
  rich	
  and	
  robust	
  image	
  
processing.	
  
Ã  ImageMagick	
  is	
  great	
  J	
  
–  There	
  is	
  a	
  large	
  and	
  mature	
  community	
  of	
  users	
  
–  It	
  has	
  been	
  around	
  for	
  years	
  and	
  has	
  all	
  the	
  primiAves	
  that	
  you	
  could	
  ask	
  for	
  
Ã  ImageMagick	
  is	
  confusing	
  K	
  
–  Image	
  preprocessing	
  can	
  be	
  a	
  daunAng	
  task	
  for	
  the	
  user	
  
–  ImageMagick	
  can	
  be	
  arcane	
  at	
  Ames	
  
8	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Image	
  Preprocessing	
  
Ã  Community	
  +	
  ImageMagick	
  =	
  Magical	
  
–  People	
  have	
  started	
  making	
  layers	
  on	
  top	
  of	
  ImageMagick	
  to	
  do	
  common	
  tasks	
  aimed	
  at	
  a	
  certain	
  
domain	
  
–  Fred	
  Weinhaus	
  did	
  this	
  for	
  text	
  cleaning!	
  
Ã  What	
  we	
  did	
  is	
  port	
  this	
  interface	
  over	
  to	
  Java	
  and	
  expose	
  it	
  as	
  a	
  library	
  
Ã  It	
  currently	
  supports	
  
–  UnrotaAon	
  (i.e.	
  straightening	
  images)	
  
–  Greyscale	
  
–  Enhance	
  brightness	
  
–  Text	
  Smoothing	
  
–  More!	
  
9	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Preprocessing	
  -­‐	
  Before	
  and	
  AJer	
  
-­‐g	
  -­‐e	
  stretch	
  -­‐f	
  25	
  -­‐o	
  20	
  -­‐t	
  30	
  -­‐u	
  -­‐s	
  1	
  -­‐T	
  -­‐p	
  20	
  
10	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Methodology	
  :	
  Scale	
  
Ã  Apache	
  Nifi	
  is	
  an	
  easy-­‐to-­‐use,	
  highly	
  customizable	
  data	
  processing	
  system	
  firmly	
  
integrated	
  with	
  the	
  Hadoop	
  Ecosystem	
  
–  Configurable	
  prioriAzaAon,	
  throughput/latency	
  tradeoffs	
  
–  Full	
  data	
  provenance	
  across	
  the	
  pipeline	
  
–  Easy	
  to	
  use	
  interface	
  for	
  customizing	
  the	
  pipeline	
  
Ã  Each	
  of	
  the	
  phases	
  in	
  the	
  pipeline	
  becomes	
  NIFI	
  Processors	
  
–  This	
  allows	
  for	
  a	
  highly	
  customizable	
  tool	
  
11	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
NiFi	
  +	
  Hadoop	
  
12	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Pipeline	
  Architecture	
  
13	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Demo	
  
14	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
OCR	
  is	
  necessary,	
  but	
  not	
  sufficient	
  
Ã  Providing	
  this	
  kind	
  of	
  uAlity	
  is	
  a	
  necessary	
  step,	
  but	
  there	
  are	
  missing	
  pieces	
  
Ã  Does	
  not	
  handle	
  human	
  handwriAng	
  as	
  of	
  yet	
  
–  Deep	
  learning	
  advances	
  are	
  closing	
  the	
  gap	
  on	
  this	
  
Ã  Even	
  with	
  very	
  good	
  image	
  preprocessing,	
  errors	
  can	
  creep	
  into	
  documents	
  
–  Kerning	
  errors	
  :	
  rn	
  -­‐>	
  m	
  
–  Unresolvable	
  blemishes	
  leading	
  to	
  random	
  noise	
  
Ã  Good	
  error	
  correcAon	
  can	
  require	
  advanced	
  NLP	
  and	
  can	
  be	
  domain	
  specific	
  
–  See	
  patent	
  #20160019430:	
  “Targeted	
  opAcal	
  character	
  recogniAon	
  for	
  medical	
  terminology”	
  
15	
   ©	
  Hortonworks	
  Inc.	
  2011	
  –	
  2016.	
  All	
  Rights	
  Reserved	
  
Ques>ons?	
  
All	
  of	
  this	
  sorware	
  shown	
  in	
  this	
  presentaAon	
  is	
  open	
  source	
  and	
  located	
  at	
  
h]ps://github.com/mmiklavc/scalable-­‐ocr	
  	
  
Find	
  us	
  on	
  Twi]er	
  
	
  
	
   	
  @casey_stella	
  
	
  
	
   	
  @MikeMiklavcic	
  

More Related Content

PDF
JDK: 新しいリリースモデル解説(ver.2.1)
PPT
Glenmark analyst ppt
PDF
Global Delivery Model - A Blue Ocean Strategy of INFOSYS
PDF
Business Case Study on PricewaterhouseCoopers (PwC)
PPTX
Arvind eye hospital by Virajsinh Mahida M.pharm+MBA
PPTX
( Polyclinic Service in Rural Market) Marketing ppt
PPTX
From Zero to Data Flow in Hours with Apache NiFi
PPTX
OCR using Tesseract
JDK: 新しいリリースモデル解説(ver.2.1)
Glenmark analyst ppt
Global Delivery Model - A Blue Ocean Strategy of INFOSYS
Business Case Study on PricewaterhouseCoopers (PwC)
Arvind eye hospital by Virajsinh Mahida M.pharm+MBA
( Polyclinic Service in Rural Market) Marketing ppt
From Zero to Data Flow in Hours with Apache NiFi
OCR using Tesseract

Viewers also liked (20)

PDF
IMPACT Interoperability Framework - Clemens Neudecker
PPTX
Processing and retrieval of geotagged unmanned aerial system telemetry
PPTX
Tasract OCR
PPTX
BrailleOCR: An Open Source Document to Braille Converter Application
PPTX
Data Aggregation, Curation and analytics for security and situational awareness
PPTX
Tiny Google Projects
PDF
2 architecture anddatastructures
PDF
The Next Generation of Data Processing and Open Source
PPTX
TCDL15 Beyond eMOP
PDF
reelyActive Brick & Mortar Retail Solution
PDF
Machine Learning Methods For Captcha Recognition
PDF
Mobius: C# Language Binding For Spark
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
PPTX
Tesseract OCR Engine
PPTX
OCR processing with deep learning: Apply to Vietnamese documents
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
PDF
Introduction to the graph technologies landscape
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PPTX
Hortonworks Data In Motion Series Part 4
IMPACT Interoperability Framework - Clemens Neudecker
Processing and retrieval of geotagged unmanned aerial system telemetry
Tasract OCR
BrailleOCR: An Open Source Document to Braille Converter Application
Data Aggregation, Curation and analytics for security and situational awareness
Tiny Google Projects
2 architecture anddatastructures
The Next Generation of Data Processing and Open Source
TCDL15 Beyond eMOP
reelyActive Brick & Mortar Retail Solution
Machine Learning Methods For Captcha Recognition
Mobius: C# Language Binding For Spark
Spark and Hadoop Perfect Togeher by Arun Murthy
Tesseract OCR Engine
OCR processing with deep learning: Apply to Vietnamese documents
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Introduction to the graph technologies landscape
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Hortonworks Data In Motion Series Part 4
Ad

Similar to Scalable OCR with NiFi and Tesseract (20)

PDF
DSSML24_tspann_CodelessGenerativeAIPipelines
PDF
Optical Recognition of Handwritten Text
PDF
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
PDF
Open Computer Vision with OpenCV, Apache NiFi, TensorFlow, Python
PPTX
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PPTX
Team-98 research paper presentation.pptx
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PPTX
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
PPTX
A12REVIEW.pptx
PDF
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
PDF
Text Recognition Using Tesseract OCR Facilitating Multilingualism: A Review
PPTX
Ethiopic Scrip OCR App Front End and Backend
PDF
28March2024-Codeless-Generative-AI-Pipelines
PDF
[CB19] Shattering the dark: uncovering vulnerabilities of the dark web by Tak...
PDF
Information Extraction from Product Labels: A Machine Vision Approach
PDF
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
PDF
Information Extraction from Product Labels: A Machine Vision Approach
DSSML24_tspann_CodelessGenerativeAIPipelines
Optical Recognition of Handwritten Text
Audio computing Image to Text Synthesizer - A Cutting-Edge Content Generator ...
Open Computer Vision with OpenCV, Apache NiFi, TensorFlow, Python
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Team-98 research paper presentation.pptx
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
A12REVIEW.pptx
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Text Recognition Using Tesseract OCR Facilitating Multilingualism: A Review
Ethiopic Scrip OCR App Front End and Backend
28March2024-Codeless-Generative-AI-Pipelines
[CB19] Shattering the dark: uncovering vulnerabilities of the dark web by Tak...
Information Extraction from Product Labels: A Machine Vision Approach
INFORMATION EXTRACTION FROM PRODUCT LABELS: A MACHINE VISION APPROACH
Information Extraction from Product Labels: A Machine Vision Approach
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf

Scalable OCR with NiFi and Tesseract

  • 1. Scalable  OCR  With   NiFi  &  Tesseract   Casey  Stella  &  Michael  Miklavcic  
  • 2. 2   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Introduc>on   Ã  Casey  Stella   –  Currently  a  data  scienAst  on  Apache  Metron   –  Previously  Architect  in  Hortonworks  Professional  Services   Ã  Michael  Miklavcic   –  Currently  an  engineer  on  Apache  Metron   –  Previously  Architect  in  Hortonworks  Professional  Services   About  the  Speakers  
  • 3. 3   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   OCR  At  Scale:  The  Challenge   Ã  Unstructured  data  is  growing  aggressively   Ã  Much  of  this  data  is  in  the  form  of  PDF  images  of  text   –  This  appears  to  be  the  case  inside  of  organizaAons  much  more  than  on  the  internet   Ã  There  is  much  we  can  do  to  extract  meaning  from  this   –  NLP  is  one  of  our  most  mature  and  rich  branches  of  machine  learning   –  Simple  textual  analysis  would  be  sufficient  to  have  rich  insights   Ã  OCR  enables  us  to  extract  textual  informaAon  from  images  in  an  intelligent  way    
  • 4. 4   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   OCR  At  Scale:  Use-­‐cases  in  Medicine   Ã  The  Problem   –  Radiologists  make  notes  about  paAents   –  Doctors  interpret  these  notes  and  make  diagnoses  based  on  the  radiologist  findings   –  SomeAmes,  the  radiologists  find  things  that  are  serendipitous  or  are  not  definiAve.   Ã  The  Value  ProposiAon   –  Building  a  data  pipeline  at  scale  to  analyze  radiologist  reports  and  look  for  indicaAons  of  missed   diagnoses   –  This  is  correct  place  for  advanced  analyAcs:  in  the  loop  with  humans    
  • 5. 5   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   OCR  At  Scale:  Use-­‐cases  in  Journalism   Ã  The  Problem   –  Journalists  are  now  asked  to  analyze  large  volumes  of  data   –  The  Panama  Papers  alone  were  2.6TB  of  data,  much  of  it  in  scanned  images  of  pages   –  FOIA  requests  can  quickly  outstrip  the  reading  capability  of  a  single  person  or  team   Ã  The  Value  ProposiAon   –  Building  a  scalable  data  pipeline  to  extract  the  text  from  the  data  journalists  are  asked  to  mine   enables  more  advanced  analyAcs  and  be]er  reporAng.   –  This  is  a  tool  to  enable  be]er  journalism  
  • 6. 6   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Methodology  :  OCR   Ã  Conversion   –  Take  PDF’s  and  turn  them  into  TIFF  files,  page-­‐wise   –  GhostScript  via  Ghost4j   Ã  Preprocessing   –  Prepare  images  by  enhancing  text  and  cleaning  up  arAfacts   –  Enable  cleaner  text  extracAon   –  A  preprocessing  pipeline  using  ImageMagick  under  the  hood   Ã  ExtracAon   –  OCR  phase  using  Tesseract  
  • 7. 7   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Image  Preprocessing   Ã  ImageMagick  is  a  standard  open  source  library  and  tool  to  do  rich  and  robust  image   processing.   Ã  ImageMagick  is  great  J   –  There  is  a  large  and  mature  community  of  users   –  It  has  been  around  for  years  and  has  all  the  primiAves  that  you  could  ask  for   Ã  ImageMagick  is  confusing  K   –  Image  preprocessing  can  be  a  daunAng  task  for  the  user   –  ImageMagick  can  be  arcane  at  Ames  
  • 8. 8   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Image  Preprocessing   Ã  Community  +  ImageMagick  =  Magical   –  People  have  started  making  layers  on  top  of  ImageMagick  to  do  common  tasks  aimed  at  a  certain   domain   –  Fred  Weinhaus  did  this  for  text  cleaning!   Ã  What  we  did  is  port  this  interface  over  to  Java  and  expose  it  as  a  library   Ã  It  currently  supports   –  UnrotaAon  (i.e.  straightening  images)   –  Greyscale   –  Enhance  brightness   –  Text  Smoothing   –  More!  
  • 9. 9   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Preprocessing  -­‐  Before  and  AJer   -­‐g  -­‐e  stretch  -­‐f  25  -­‐o  20  -­‐t  30  -­‐u  -­‐s  1  -­‐T  -­‐p  20  
  • 10. 10   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Methodology  :  Scale   Ã  Apache  Nifi  is  an  easy-­‐to-­‐use,  highly  customizable  data  processing  system  firmly   integrated  with  the  Hadoop  Ecosystem   –  Configurable  prioriAzaAon,  throughput/latency  tradeoffs   –  Full  data  provenance  across  the  pipeline   –  Easy  to  use  interface  for  customizing  the  pipeline   Ã  Each  of  the  phases  in  the  pipeline  becomes  NIFI  Processors   –  This  allows  for  a  highly  customizable  tool  
  • 11. 11   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   NiFi  +  Hadoop  
  • 12. 12   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Pipeline  Architecture  
  • 13. 13   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Demo  
  • 14. 14   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   OCR  is  necessary,  but  not  sufficient   Ã  Providing  this  kind  of  uAlity  is  a  necessary  step,  but  there  are  missing  pieces   Ã  Does  not  handle  human  handwriAng  as  of  yet   –  Deep  learning  advances  are  closing  the  gap  on  this   Ã  Even  with  very  good  image  preprocessing,  errors  can  creep  into  documents   –  Kerning  errors  :  rn  -­‐>  m   –  Unresolvable  blemishes  leading  to  random  noise   Ã  Good  error  correcAon  can  require  advanced  NLP  and  can  be  domain  specific   –  See  patent  #20160019430:  “Targeted  opAcal  character  recogniAon  for  medical  terminology”  
  • 15. 15   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved   Ques>ons?   All  of  this  sorware  shown  in  this  presentaAon  is  open  source  and  located  at   h]ps://github.com/mmiklavc/scalable-­‐ocr     Find  us  on  Twi]er        @casey_stella        @MikeMiklavcic