Near Realtime Processing over HBase

Near-‐Real(me
Processing

over
HBase
Ryan
Brush
@ryanbrush

Topics
-The
story
so
far

-Complemen8ng
MapReduce
with

stream-‐based
processing

-Techniques
and
lessons

-Query
and
search

-The
future

Chart
Search
-Informa8on

extrac8on

-Seman8c
markup
of

documents

-Related
concepts
in

search
results

-Processing
latency:

tens
of
minutes

Medical
Alerts
-Detect
health
risks

in
incoming
data

-No8fy
clinicians
to

address
those
risks

-Quickly
include
new

knowledge

-Processing
latency:

single-‐digit
minutes

Exploring
live
data
-Novel
ways
of

exploring
records

-Pre-‐computed

models
matching

users’
access

paLerns

-Very
fast
load
8mes

-Processing
latency:

seconds
or
faster

And
many
others
Popula(on
analy(cs
Care
coordina(on
Personalized
health

plans
- Data
sets
growing
at
hundreds
of
GBs
per
day

- Approaching
1
petabyte
total
data

- Rate
is
increasing;
expec8ng
mul8-‐petabyte

data
sets

-Analyze
all
data
holis8cally

-Quickly
apply
incremental
updates
A
trend
towards
compe8ng
needs

A
trend
towards
compe8ng
needs
MapReduce
- (re-‐)Process
all
data

- Move
computa8on
to
data

- Output
is
a
pure
func8on
of

the
input

- Assumes
set
of
sta8c
input
Stream
- Incremental
updates

- Move
data
to
computa8on

- Needs
to
clean
up
outdated

state

- Input
may
be
incomplete
or
out

of
order
Both
processing
models
are
necessary

and
the
underlying
logic
must
be
the
same

A
trend
towards
compe8ng
needs
Speed Layer
Batch Layer
hLp://nathanmarz.com/blog/how-‐to-‐beat-‐the-‐cap-‐theorem.html

Speed Layer
Batch Layer
High
Latency
(minutes
or
hours

to
process)
Low
Latency
(seconds
to
process)
Move
data
to
computa(on
Move
computa(on
to
dataYears
of
data
Hours
of
data
Bulk
loads
Incremental
updates
A
trend
towards
compe8ng
needs
hLp://nathanmarz.com/blog/how-‐to-‐beat-‐the-‐cap-‐theorem.html

Realtime Layer
Batch Layer
MapReduce
Storm
Stream-‐based
Hadoop
A
trend
towards
compe8ng
needs

Into
the
rabbit
hole
-A
ride
through
the
system

-Techniques
and
lessons
learned

along
the
way

Data
inges8on
-Stream
data
into
HTTPS
service

-Content
stored
as
Protocol
Buﬀers

-Mirror
the
raw
data
as
simply
as

possible
/source:1/document:123
/source:2/allergy:345
/source:2/document:456
/source:2/order:234
…
/source:n/prescription:789
HBase
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS

Scan
for
updates
Process
incoming
data
-Ini8ally
modeled

aYer
Google

Percolator

-“No8fica8on”

records
indicate

changes

-Scan
for
no8fica8ons
Data
Table
source:1/document:123
source:2/allergy:345
.
.
.

source:150/order:71
No8fica8on
Table
source:150/order:71

But
there’s
a
catch…
-Percolator-‐style
no8ﬁca8on
records

require
external
coordina8on

-More
infrastructure
to
build,

maintain

-…so
let’s
use
HBase’s
primi8ves

Scan
for
updates
Process
incoming
data
- Consumers
scan
for
items
to
process

- Atomically
claim
lease
records
(CheckAndPut)

- Clear
the
record
and
no8fica8ons
when
done

- ~3000
no8fica8ons
per
second
per
node
Row
Key Qualifiers
(lease
record
and
keys
of
updated
items)
split:0 0000_LEASE,
source:2/allergy:345,
source:150/order:71,
…
split:1 0000_LEASE,
source:4/problem:78,
source:205/document:52,
…
.
.
.

Advantages
-No
addi8onal
infrastructure

-Leverages
HBase
guarantees

-No
lost
data

-No
stranded
data
due
to
machine
failure

-Robust
to
volume
spikes
of
tens
of

millions
of
records

Downsides
-Weak
ordering
guarantees

-Processing
must
be
idempotent

-Lots
of
garbage
from
deleted
cells

-Schedule
major
compac8ons!

-Must
split
to
avoid
hot
regions

-Poten8ally
beLer
op8ons
emerging

-Apache
Kana
with
replica8on

Measure
Everything
- Instrumented
HBase
client
to
see
eﬀec8ve

performance

- We
use
Coda
Hale’s
Metrics
API
and
Graphite

Reporter

- Revealed
impact
of
hot
HBase
regions
on
clients

The
story
so
far
HBase
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS Data Notiﬁcations
Incremental
Processors
Load data
Scan for updates

Into
the
Storm
-Storm:
scalable
processing
of
data

in
mo8on

-Complements
HBase
and
Hadoop

-Guaranteed
message
processing
in

a
distributed
environment

-No8ﬁca8ons
scanned
by
a
Storm

Spout

Processing
with
Storm
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
Apps
Services

Challenges
of
incremental
updates
-Incomplete
data

-Outdated
state

-Diﬃcult
to
reason
about
changing

state
and
8ming
condi8ons

Handling
Incomplete
Data
Row
Key Summary
Family Staging
Family
document:1 page:1
Incoming
data
- Process
(map)
components
into
a
staging

family

Handling
Incomplete
Data
Row
Key Summary
Family Staging
Family
document:1 page:1

page:3
Incoming
data
- Process
(map)
components
into
a
staging

family

Handling
Incomplete
Data
Row
Key Summary
Family Staging
Family
document:1 page:1
page:2
page:3
Incoming
data
- Process
(map)
components
into
a
staging

family

Handling
Incomplete
Data
Row
Key Summary
Family Staging
Family
document:1 document_summary page:1
page:2
page:3
- Process
(map)
components
into
a
staging

family

- Merge
(reduce)
components
when
everything

is
available

- Many
cases
need
no
merge
phase;
consuming

apps
simply
read
all
of
the
components
Incoming
data

Outdated
State
Time
0:
Alice
lives
in
Chicago
Time
1:
Alice
lives
in
New
York
Incoming
Data
Chicago
resident
index
New
York
resident
index
Processed
Data
- Big
Data

- MapReduce:
rebuild
processed
data

- Outdated
state
is
simply
ignored

- Fast
Updates

- ACID
database:
simply
update
Alice’s
loca8on

- Big
and
Fast:
it
gets
complicated

Outdated
State:
Reconcile
on
Read
Historical Data
(MapReduce
Output)
Incremental
Updates
Merge Application
- Akin
to
Marz’s
Lambda
Architecture

- Data
stores
op8mized
for

speciﬁc
workloads

- Keeps
processing
models

independent

- Adds
complexity
at
read

8me,
but
simpler
overall
- Not
available
in
commodity
app
stacks

- Probably
best
approach
when
and
if
higher-‐
level
abstrac8ons
emerge

Outdated
State:
Reconcile
on
Write
Time
0:
Alice
lives
in
Chicago
Time
1:
Alice
lives
in
New
York
Incoming
Data
Chicago
resident
index
New
York
resident
index
Processed
Data
- Keep
history
of
your
incoming
data

- When
the
event
at
Time
1
occurs,
read
that
history

and
update
both
indexes

- Works
with
many
exis8ng
data
stores

- Adds
complexity
to
processing
logic

- Data
store
must
handle
MapReduce
and
real8me

loads
-‐-‐
may
not
be
op8mal

Diﬀerent
models,
same
logic
-Incremental
updates
like
a
rolling

MapReduce

-Func(ons
are
the
center
of
the

universe
(not
InputFormats
or
Messages)

-Write
logic
as
pure
func8ons,

coordinate
with
higher
libraries

- Storm

- Apache
Crunch

Gesng
complicated?
-Incremental
logic
is
complex
and

error
prone

-Use
MapReduce
as
a
failsafe
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
MapReduce
Apps
Services

Reprocess
during
up8me
-Deploy
new
incremental
processing
logic

-“Older”
8mestamps
produced
by

MapReduce

-The
most
recently
wriLen
cell
in
HBase

need
not
be
the
logical
newest
Row
Key Document
Family
document:1 {doc,
ts=50}
document:2 {doc,
ts=100}
Real
8me
incremental
update
,
{doc,
ts=300}
MapReduce
outputs
,
{doc
ts=200}
,
{doc,
ts=200}

Comple8ng
the
Picture
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
MapReduce
Apps
Services

Comple8ng
the
Picture
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
MapReduce
Apps
Services
Search
Indexes

Building
indexes
with
MapReduce
-A
shard
per
task

-Build
index
in
Hadoop

-Copy
to
index
hosts
Embedded
Solr
Map Task
Index
Shard
Embedded
Solr
Map Task
Index
Shard
Embedded
Solr
Map Task
Index
Shard

Pushing
incremental
updates
-POST
new
records

-Bursts
can
overwhelm

target
hosts

-Consumers
must
deal

with
transient
failures
Solr
Shard
Solr
Shard
Solr
Shard
Replica
Replica
Replica
Processor
Data stream

Pulling
indexes
from
HBase
- Custom
Solr
plugin

scans
a
range
of
HBase

rows

- Time-‐based
scan
to
get

only
updates

- Pulls
items
to
index

from
HBase

- Cleanly
recovers
from

volume
spikes
and

transient
failures
person:1
person:2
. . .
person:n
person:n + 1
….
person:m
HBase
Solr
Shard
Solr
Shard
Solr
Scan
Scan
Scan

A
note
on
schema:
simplify
it!
-Heterogeneous
row

keys
eﬃcient
but
hard

to
reason
about

-Must
inspect
row
key

to
know
what
it
is

-Mismatches
tools
like

Pig
or
Hive
Row
Key Qualiﬁers
person:1/name <content>
person:1/address <content>
person:1/friend:1 <content>
person:1/friend:2 <content>
person:2/name <content>
…
person:n/name <content>
person:n/friend:m <content>

Logical
parent
per
row
-The
row
is
the
unit
of
locality

-Tabular
layout
is
easy
to
understand

-No
lost
eﬃciency
for
most
cases

-HBase
Schema
Design
-‐-‐
Ian
Varley
at
2012

HBaseCon
Row
Key Qualiﬁers
person:1 name<…>
address:<…>
friend:1:<…>
friend:2:<…>
person:2 name<…>
address:<…>
friend:1:<…>
.
.
.
person:n name<…>
address:<…>
friend:1:<…>

This
paMern
has
been
successful
…but
complexity
is
our
biggest

enemy

We
may
be
in
the
assembly

language
era
of
big
data

Higher-‐level
abstrac(ons
for

these
paMerns
will
emerge
It’s
going
to
be
fun

Ques8ons?
@ryanbrush
hLps://engineering.cerner.com

Near Realtime Processing over HBase

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Near Realtime Processing over HBase (20)

Recently uploaded (20)

Near Realtime Processing over HBase