Scalable OCR with NiFi and Tesseract

Scalable
OCR
With

NiFi
&
Tesseract

Casey
Stella
&
Michael
Miklavcic

2
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Introduc>on

Ã  Casey
Stella

–  Currently
a
data
scienAst
on
Apache
Metron

–  Previously
Architect
in
Hortonworks
Professional
Services

Ã  Michael
Miklavcic

–  Currently
an
engineer
on
Apache
Metron

–  Previously
Architect
in
Hortonworks
Professional
Services

About
the
Speakers

3
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

OCR
At
Scale:
The
Challenge

Ã  Unstructured
data
is
growing
aggressively

Ã  Much
of
this
data
is
in
the
form
of
PDF
images
of
text

–  This
appears
to
be
the
case
inside
of
organizaAons
much
more
than
on
the
internet

Ã  There
is
much
we
can
do
to
extract
meaning
from
this

–  NLP
is
one
of
our
most
mature
and
rich
branches
of
machine
learning

–  Simple
textual
analysis
would
be
suﬃcient
to
have
rich
insights

Ã  OCR
enables
us
to
extract
textual
informaAon
from
images
in
an
intelligent
way

4
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

OCR
At
Scale:
Use-‐cases
in
Medicine

Ã  The
Problem

–  Radiologists
make
notes
about
paAents

–  Doctors
interpret
these
notes
and
make
diagnoses
based
on
the
radiologist
findings

–  SomeAmes,
the
radiologists
find
things
that
are
serendipitous
or
are
not
definiAve.

Ã  The
Value
ProposiAon

–  Building
a
data
pipeline
at
scale
to
analyze
radiologist
reports
and
look
for
indicaAons
of
missed

diagnoses

–  This
is
correct
place
for
advanced
analyAcs:
in
the
loop
with
humans

5
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

OCR
At
Scale:
Use-‐cases
in
Journalism

Ã  The
Problem

–  Journalists
are
now
asked
to
analyze
large
volumes
of
data

–  The
Panama
Papers
alone
were
2.6TB
of
data,
much
of
it
in
scanned
images
of
pages

–  FOIA
requests
can
quickly
outstrip
the
reading
capability
of
a
single
person
or
team

Ã  The
Value
ProposiAon

–  Building
a
scalable
data
pipeline
to
extract
the
text
from
the
data
journalists
are
asked
to
mine

enables
more
advanced
analyAcs
and
be]er
reporAng.

–  This
is
a
tool
to
enable
be]er
journalism

6
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Methodology
:
OCR

Ã  Conversion

–  Take
PDF’s
and
turn
them
into
TIFF
ﬁles,
page-‐wise

–  GhostScript
via
Ghost4j

Ã  Preprocessing

–  Prepare
images
by
enhancing
text
and
cleaning
up
arAfacts

–  Enable
cleaner
text
extracAon

–  A
preprocessing
pipeline
using
ImageMagick
under
the
hood

Ã  ExtracAon

–  OCR
phase
using
Tesseract

7
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Image
Preprocessing

Ã  ImageMagick
is
a
standard
open
source
library
and
tool
to
do
rich
and
robust
image

processing.

is
great
J

–  There
is
a
large
and
mature
community
of
users

–  It
has
been
around
for
years
and
has
all
the
primiAves
that
you
could
ask
for

is
confusing
K

–  Image
preprocessing
can
be
a
daunAng
task
for
the
user

–  ImageMagick
can
be
arcane
at
Ames

8
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Image
Preprocessing

Ã  Community
+
ImageMagick
=
Magical

–  People
have
started
making
layers
on
top
of
ImageMagick
to
do
common
tasks
aimed
at
a
certain

domain

–  Fred
Weinhaus
did
this
for
text
cleaning!

Ã  What
we
did
is
port
this
interface
over
to
Java
and
expose
it
as
a
library

Ã  It
currently
supports

–  UnrotaAon
(i.e.
straightening
images)

–  Greyscale

–  Enhance
brightness

–  Text
Smoothing

–  More!

9
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Preprocessing
-‐
Before
and
AJer

-‐g
-‐e
stretch
-‐f
25
-‐o
20
-‐t
30
-‐u
-‐s
1
-‐T
-‐p
20

10
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Methodology
:
Scale

Ã  Apache
Nifi
is
an
easy-‐to-‐use,
highly
customizable
data
processing
system
firmly

integrated
with
the
Hadoop
Ecosystem

–  Configurable
prioriAzaAon,
throughput/latency
tradeoffs

–  Full
data
provenance
across
the
pipeline

–  Easy
to
use
interface
for
customizing
the
pipeline

Ã  Each
of
the
phases
in
the
pipeline
becomes
NIFI
Processors

–  This
allows
for
a
highly
customizable
tool

14
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

OCR
is
necessary,
but
not
suﬃcient

Ã  Providing
this
kind
of
uAlity
is
a
necessary
step,
but
there
are
missing
pieces

Ã  Does
not
handle
human
handwriAng
as
of
yet

–  Deep
learning
advances
are
closing
the
gap
on
this

Ã  Even
with
very
good
image
preprocessing,
errors
can
creep
into
documents

–  Kerning
errors
:
rn
-‐>
m

–  Unresolvable
blemishes
leading
to
random
noise

Ã  Good
error
correcAon
can
require
advanced
NLP
and
can
be
domain
speciﬁc

–  See
patent
#20160019430:
“Targeted
opAcal
character
recogniAon
for
medical
terminology”

15
©
Hortonworks
Inc.
2011
–
2016.
All
Rights
Reserved

Ques>ons?

All
of
this
sorware
shown
in
this
presentaAon
is
open
source
and
located
at

h]ps://github.com/mmiklavc/scalable-‐ocr

Find
us
on
Twi]er

@casey_stella

@MikeMiklavcic

Scalable OCR with NiFi and Tesseract

More Related Content

Viewers also liked (20)

Similar to Scalable OCR with NiFi and Tesseract (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Scalable OCR with NiFi and Tesseract