SlideShare a Scribd company logo
Near-­‐Real(me	
  Processing	
  
over	
  HBase
Ryan	
  Brush
@ryanbrush
Topics
-The	
  story	
  so	
  far	
  
-Complemen8ng	
  MapReduce	
  with	
  
stream-­‐based	
  processing	
  
-Techniques	
  and	
  lessons	
  
-Query	
  and	
  search	
  
-The	
  future
The	
  story	
  so	
  far...
Chart	
  Search
Chart	
  Search
-Informa8on	
  
extrac8on	
  
-Seman8c	
  markup	
  of	
  
documents	
  
-Related	
  concepts	
  in	
  
search	
  results	
  
-Processing	
  latency:	
  
tens	
  of	
  minutes
Medical	
  Alerts
Medical	
  Alerts
-Detect	
  health	
  risks	
  
in	
  incoming	
  data	
  
-No8fy	
  clinicians	
  to	
  
address	
  those	
  risks	
  
-Quickly	
  include	
  new	
  
knowledge	
  
-Processing	
  latency:	
  
single-­‐digit	
  minutes
Exploring	
  live	
  data
Exploring	
  live	
  data
-Novel	
  ways	
  of	
  
exploring	
  records	
  
-Pre-­‐computed	
  
models	
  matching	
  
users’	
  access	
  
paLerns	
  
-Very	
  fast	
  load	
  8mes	
  
-Processing	
  latency:	
  
seconds	
  or	
  faster
And	
  many	
  others
Popula(on	
  analy(cs
Care	
  coordina(on
Personalized	
  health	
  
plans
- Data	
  sets	
  growing	
  at	
  hundreds	
  of	
  GBs	
  per	
  day	
  
- Approaching	
  1	
  petabyte	
  total	
  data	
  
- Rate	
  is	
  increasing;	
  expec8ng	
  mul8-­‐petabyte	
  
data	
  sets
-Analyze	
  all	
  data	
  holis8cally	
  
-Quickly	
  apply	
  incremental	
  updates
A	
  trend	
  towards	
  compe8ng	
  needs
A	
  trend	
  towards	
  compe8ng	
  needs
MapReduce
- (re-­‐)Process	
  all	
  data	
  
- Move	
  computa8on	
  to	
  data	
  
- Output	
  is	
  a	
  pure	
  func8on	
  of	
  
the	
  input	
  
- Assumes	
  set	
  of	
  sta8c	
  input
Stream
- Incremental	
  updates	
  
- Move	
  data	
  to	
  computa8on	
  
- Needs	
  to	
  clean	
  up	
  outdated	
  
state	
  
- Input	
  may	
  be	
  incomplete	
  or	
  out	
  
of	
  order
Both	
  processing	
  models	
  are	
  necessary	
  
and	
  the	
  underlying	
  logic	
  must	
  be	
  the	
  same
A	
  trend	
  towards	
  compe8ng	
  needs
Speed Layer
Batch Layer
hLp://nathanmarz.com/blog/how-­‐to-­‐beat-­‐the-­‐cap-­‐theorem.html
Speed Layer
Batch Layer
High	
  Latency	
  (minutes	
  or	
  hours	
  
to	
  process)
Low	
  Latency	
  (seconds	
  to	
  process)
Move	
  data	
  to	
  computa(on
Move	
  computa(on	
  to	
  dataYears	
  of	
  data
Hours	
  of	
  data
Bulk	
  loads
Incremental	
  updates
A	
  trend	
  towards	
  compe8ng	
  needs
hLp://nathanmarz.com/blog/how-­‐to-­‐beat-­‐the-­‐cap-­‐theorem.html
Realtime Layer
Batch Layer
MapReduce
Storm
Stream-­‐based
Hadoop
A	
  trend	
  towards	
  compe8ng	
  needs
Into	
  the	
  rabbit	
  hole
-A	
  ride	
  through	
  the	
  system	
  
-Techniques	
  and	
  lessons	
  learned	
  
along	
  the	
  way
Data	
  inges8on
-Stream	
  data	
  into	
  HTTPS	
  service	
  
-Content	
  stored	
  as	
  Protocol	
  Buffers	
  
-Mirror	
  the	
  raw	
  data	
  as	
  simply	
  as	
  
possible
/source:1/document:123
/source:2/allergy:345
/source:2/document:456
/source:2/order:234
…
/source:n/prescription:789
HBase
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Scan	
  for	
  updates
Process	
  incoming	
  data
-Ini8ally	
  modeled	
  
aYer	
  Google	
  
Percolator	
  
-“No8fica8on”	
  
records	
  indicate	
  
changes	
  
-Scan	
  for	
  no8fica8ons
Data	
  Table
source:1/document:123
source:2/allergy:345
source:2/document:456
.	
  .	
  .	
  
source:150/order:71
No8fica8on	
  Table
source:1/document:123
source:150/order:71
But	
  there’s	
  a	
  catch…
-Percolator-­‐style	
  no8fica8on	
  records	
  
require	
  external	
  coordina8on	
  
-More	
  infrastructure	
  to	
  build,	
  
maintain	
  
-…so	
  let’s	
  use	
  HBase’s	
  primi8ves
Scan	
  for	
  updates
Process	
  incoming	
  data
- Consumers	
  scan	
  for	
  items	
  to	
  process	
  
- Atomically	
  claim	
  lease	
  records	
  (CheckAndPut)	
  
- Clear	
  the	
  record	
  and	
  no8fica8ons	
  when	
  done	
  
- ~3000	
  no8fica8ons	
  per	
  second	
  per	
  node
Row	
  Key Qualifiers	
  (lease	
  record	
  and	
  keys	
  of	
  updated	
  items)
split:0 0000_LEASE,	
  source:2/allergy:345,	
  source:150/order:71,	
  …
split:1 0000_LEASE,	
  source:4/problem:78,	
  source:205/document:52,	
  …
.	
  .	
  .
Advantages
-No	
  addi8onal	
  infrastructure	
  
-Leverages	
  HBase	
  guarantees	
  
-No	
  lost	
  data	
  
-No	
  stranded	
  data	
  due	
  to	
  machine	
  failure	
  
-Robust	
  to	
  volume	
  spikes	
  of	
  tens	
  of	
  
millions	
  of	
  records
Downsides
-Weak	
  ordering	
  guarantees	
  
-Processing	
  must	
  be	
  idempotent	
  
-Lots	
  of	
  garbage	
  from	
  deleted	
  cells	
  
-Schedule	
  major	
  compac8ons!	
  
-Must	
  split	
  to	
  avoid	
  hot	
  regions	
  
-Poten8ally	
  beLer	
  op8ons	
  emerging	
  
-Apache	
  Kana	
  with	
  replica8on
Measure	
  Everything
- Instrumented	
  HBase	
  client	
  to	
  see	
  effec8ve	
  
performance	
  
- We	
  use	
  Coda	
  Hale’s	
  Metrics	
  API	
  and	
  Graphite	
  
Reporter	
  
- Revealed	
  impact	
  of	
  hot	
  HBase	
  regions	
  on	
  clients
The	
  story	
  so	
  far
HBase
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS Data Notifications
Incremental
Processors
Load data
Scan for updates
Into	
  the	
  Storm
-Storm:	
  scalable	
  processing	
  of	
  data	
  
in	
  mo8on	
  
-Complements	
  HBase	
  and	
  Hadoop	
  
-Guaranteed	
  message	
  processing	
  in	
  
a	
  distributed	
  environment	
  
-No8fica8ons	
  scanned	
  by	
  a	
  Storm	
  
Spout
Processing	
  with	
  Storm
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
Apps
Services
Challenges	
  of	
  incremental	
  updates
-Incomplete	
  data	
  
-Outdated	
  state	
  
-Difficult	
  to	
  reason	
  about	
  changing	
  
state	
  and	
  8ming	
  condi8ons
Handling	
  Incomplete	
  Data
Row	
  Key Summary	
  Family Staging	
  Family
document:1 page:1
Incoming	
  data
- Process	
  (map)	
  components	
  into	
  a	
  staging	
  
family
Handling	
  Incomplete	
  Data
Row	
  Key Summary	
  Family Staging	
  Family
document:1 page:1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  page:3
Incoming	
  data
- Process	
  (map)	
  components	
  into	
  a	
  staging	
  
family
Handling	
  Incomplete	
  Data
Row	
  Key Summary	
  Family Staging	
  Family
document:1 page:1	
  page:2	
  page:3
Incoming	
  data
- Process	
  (map)	
  components	
  into	
  a	
  staging	
  
family
Handling	
  Incomplete	
  Data
Row	
  Key Summary	
  Family Staging	
  Family
document:1 document_summary page:1	
  page:2	
  page:3
- Process	
  (map)	
  components	
  into	
  a	
  staging	
  
family	
  
- Merge	
  (reduce)	
  components	
  when	
  everything	
  
is	
  available	
  	
  
- Many	
  cases	
  need	
  no	
  merge	
  phase;	
  consuming	
  
apps	
  simply	
  read	
  all	
  of	
  the	
  components
Incoming	
  data
Outdated	
  State
Time	
  0:	
  Alice	
  lives	
  in	
  Chicago
Time	
  1:	
  Alice	
  lives	
  in	
  New	
  York
Incoming	
  Data
Chicago	
  resident	
  index
New	
  York	
  resident	
  index
Processed	
  Data
- Big	
  Data	
  
- MapReduce:	
  rebuild	
  processed	
  data	
  
- Outdated	
  state	
  is	
  simply	
  ignored	
  
- Fast	
  Updates	
  
- ACID	
  database:	
  simply	
  update	
  Alice’s	
  loca8on	
  
- Big	
  and	
  Fast:	
  it	
  gets	
  complicated
Outdated	
  State:	
  Reconcile	
  on	
  Read
Historical Data
(MapReduce
Output)
Incremental
Updates
Merge Application
- Akin	
  to	
  Marz’s	
  Lambda	
  Architecture	
  
- Data	
  stores	
  op8mized	
  for	
  	
  	
  	
  	
  	
  	
  
specific	
  workloads	
  
- Keeps	
  processing	
  models	
  
independent	
  
- Adds	
  complexity	
  at	
  read	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
8me,	
  but	
  simpler	
  overall
- Not	
  available	
  in	
  commodity	
  app	
  stacks	
  
- Probably	
  best	
  approach	
  when	
  and	
  if	
  higher-­‐
level	
  abstrac8ons	
  emerge
Outdated	
  State:	
  Reconcile	
  on	
  Write
Time	
  0:	
  Alice	
  lives	
  in	
  Chicago
Time	
  1:	
  Alice	
  lives	
  in	
  New	
  York
Incoming	
  Data
Chicago	
  resident	
  index
New	
  York	
  resident	
  index
Processed	
  Data
- Keep	
  history	
  of	
  your	
  incoming	
  data	
  
- When	
  the	
  event	
  at	
  Time	
  1	
  occurs,	
  read	
  that	
  history	
  
and	
  update	
  both	
  indexes	
  
- Works	
  with	
  many	
  exis8ng	
  data	
  stores	
  
- Adds	
  complexity	
  to	
  processing	
  logic	
  
- Data	
  store	
  must	
  handle	
  MapReduce	
  and	
  real8me	
  
loads	
  -­‐-­‐	
  may	
  not	
  be	
  op8mal
Different	
  models,	
  same	
  logic
-Incremental	
  updates	
  like	
  a	
  rolling	
  
MapReduce	
  
-Func(ons	
  are	
  the	
  center	
  of	
  the	
  
universe	
  (not	
  InputFormats	
  or	
  Messages)	
  
-Write	
  logic	
  as	
  pure	
  func8ons,	
  
coordinate	
  with	
  higher	
  libraries	
  
- Storm	
  
- Apache	
  Crunch
Gesng	
  complicated?
-Incremental	
  logic	
  is	
  complex	
  and	
  
error	
  prone	
  
-Use	
  MapReduce	
  as	
  a	
  failsafe
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
MapReduce
Apps
Services
Reprocess	
  during	
  up8me
-Deploy	
  new	
  incremental	
  processing	
  logic	
  
-“Older”	
  8mestamps	
  produced	
  by	
  
MapReduce	
  
-The	
  most	
  recently	
  wriLen	
  cell	
  in	
  HBase	
  
need	
  not	
  be	
  the	
  logical	
  newest
Row	
  Key Document	
  Family
document:1 {doc,	
  ts=50}
document:2 {doc,	
  ts=100}
Real	
  8me	
  incremental	
  update
,	
  {doc,	
  ts=300}
MapReduce	
  outputs
,	
  {doc	
  ts=200}
,	
  {doc,	
  ts=200}
Comple8ng	
  the	
  Picture
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
MapReduce
Apps
Services
Comple8ng	
  the	
  Picture
Collector
Service
Source
System 1
Source
System 2
Source
System N
. . .
HTTPS
Raw Data
HBase
Bolt
Bolt
BoltSpout
Processed
Data
MapReduce
Apps
Services
Search
Indexes
Building	
  indexes	
  with	
  MapReduce
-A	
  shard	
  per	
  task	
  
-Build	
  index	
  in	
  Hadoop	
  
-Copy	
  to	
  index	
  hosts
Embedded
Solr
Map Task
Index
Shard
Embedded
Solr
Map Task
Index
Shard
Embedded
Solr
Map Task
Index
Shard
Pushing	
  incremental	
  updates
-POST	
  new	
  records	
  
-Bursts	
  can	
  overwhelm	
  
target	
  hosts	
  
-Consumers	
  must	
  deal	
  
with	
  transient	
  failures
Solr
Shard
Solr
Shard
Solr
Shard
Replica
Replica
Replica
Processor
Data stream
Pulling	
  indexes	
  from	
  HBase
- Custom	
  Solr	
  plugin	
  
scans	
  a	
  range	
  of	
  HBase	
  
rows	
  
- Time-­‐based	
  scan	
  to	
  get	
  
only	
  updates	
  
- Pulls	
  items	
  to	
  index	
  
from	
  HBase	
  
- Cleanly	
  recovers	
  from	
  
volume	
  spikes	
  and	
  
transient	
  failures
person:1
person:2
. . .
person:n
person:n + 1
….
person:m
HBase
Solr
Shard
Solr
Shard
Solr
Scan
Scan
Scan
A	
  note	
  on	
  schema:	
  simplify	
  it!
-Heterogeneous	
  row	
  
keys	
  efficient	
  but	
  hard	
  
to	
  reason	
  about	
  
-Must	
  inspect	
  row	
  key	
  
to	
  know	
  what	
  it	
  is	
  
-Mismatches	
  tools	
  like	
  
Pig	
  or	
  Hive
Row	
  Key Qualifiers
person:1/name <content>
person:1/address <content>
person:1/friend:1 <content>
person:1/friend:2 <content>
person:2/name <content>
…
person:n/name <content>
person:n/friend:m <content>
Logical	
  parent	
  per	
  row
-The	
  row	
  is	
  the	
  unit	
  of	
  locality	
  
-Tabular	
  layout	
  is	
  easy	
  to	
  understand	
  
-No	
  lost	
  efficiency	
  for	
  most	
  cases	
  
-HBase	
  Schema	
  Design	
  -­‐-­‐	
  Ian	
  Varley	
  at	
  2012	
  
HBaseCon
Row	
  Key Qualifiers
person:1 name<…>	
  address:<…>	
  friend:1:<…>	
  friend:2:<…>
person:2 name<…>	
  address:<…>	
  friend:1:<…>
.	
  .	
  .
person:n name<…>	
  address:<…>	
  friend:1:<…>
The	
  path	
  forward
This	
  paMern	
  has	
  been	
  successful
…but	
  complexity	
  is	
  our	
  biggest	
  	
  
enemy
We	
  may	
  be	
  in	
  the	
  assembly	
  
language	
  era	
  of	
  big	
  data
Higher-­‐level	
  abstrac(ons	
  for	
  
these	
  paMerns	
  will	
  emerge
It’s	
  going	
  to	
  be	
  fun
Ques8ons?
@ryanbrush
hLps://engineering.cerner.com

More Related Content

PDF
Splunk and map_reduce
PPT
Big data & hadoop framework
PDF
The Future of Real-Time in Spark
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
KEY
Large scale ETL with Hadoop
PDF
Spark + AI Summit 2020 イベント概要
PPTX
Intro to Python Data Analysis in Wakari
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Splunk and map_reduce
Big data & hadoop framework
The Future of Real-Time in Spark
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Large scale ETL with Hadoop
Spark + AI Summit 2020 イベント概要
Intro to Python Data Analysis in Wakari
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...

What's hot (20)

PDF
Streaming analytics state of the art
PPTX
Big data analytics: Technology's bleeding edge
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
PDF
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
PDF
Sparkler - Spark Crawler
PDF
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
PPTX
Data Discovery on Hadoop - Realizing the Full Potential of your Data
PPTX
Hadoop project design and a usecase
PDF
Big Data, Mob Scale.
PPTX
Large Scale Data With Hadoop
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PPTX
Analyzing Data With Python
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
PDF
Sparkler Presentation for Spark Summit East 2017
PDF
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
PPTX
Big data-at-detik
PDF
Big data and hadoop
PDF
Big data analysis using spark r published
Streaming analytics state of the art
Big data analytics: Technology's bleeding edge
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Sparkler - Spark Crawler
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan...
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Hadoop project design and a usecase
Big Data, Mob Scale.
Large Scale Data With Hadoop
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Analyzing Data With Python
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Sparkler Presentation for Spark Summit East 2017
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Big data-at-detik
Big data and hadoop
Big data analysis using spark r published
Ad

Viewers also liked (9)

PDF
Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
PPTX
Processing Complex Workflows in Advertising using Hadoop
PDF
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
PPTX
Real Time Conversion Joins Using Storm and HBase
PDF
Real-time Analytics with HBase (short version)
PDF
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
PDF
Realtime Analytics with Hadoop and HBase
KEY
Near-realtime analytics with Kafka and HBase
Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
Processing Complex Workflows in Advertising using Hadoop
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Real Time Conversion Joins Using Storm and HBase
Real-time Analytics with HBase (short version)
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Realtime Analytics with Hadoop and HBase
Near-realtime analytics with Kafka and HBase
Ad

Similar to Near Realtime Processing over HBase (20)

PDF
Architecting applications with Hadoop - Fraud Detection
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
KEY
HBase and Hadoop at Urban Airship
PPTX
Trivento summercamp masterclass 9/9/2016
PPTX
Real time fraud detection at 1+M scale on hadoop stack
ODP
Web-scale data processing: practical approaches for low-latency and batch
PPTX
Trivento summercamp fast data 9/9/2016
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PDF
Apache Big Data EU 2015 - HBase
PDF
Elasticsearch + Cascading for Scalable Log Processing
PDF
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
PPTX
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
PPTX
Real time hadoop + mapreduce intro
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PPT
Moving Towards a Streaming Architecture
PDF
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
PPTX
Pig on Storm
PDF
HBase The Definitive Guide 2 (Early Release) Edition Lars George
Architecting applications with Hadoop - Fraud Detection
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
HBase and Hadoop at Urban Airship
Trivento summercamp masterclass 9/9/2016
Real time fraud detection at 1+M scale on hadoop stack
Web-scale data processing: practical approaches for low-latency and batch
Trivento summercamp fast data 9/9/2016
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Apache Big Data EU 2015 - HBase
Elasticsearch + Cascading for Scalable Log Processing
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
Real time hadoop + mapreduce intro
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Moving Towards a Streaming Architecture
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Pig on Storm
HBase The Definitive Guide 2 (Early Release) Edition Lars George

Recently uploaded (20)

PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
System and Network Administraation Chapter 3
PPTX
history of c programming in notes for students .pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Transform Your Business with a Software ERP System
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
medical staffing services at VALiNTRY
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Introduction to Artificial Intelligence
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms I-SECS-1021-03
Reimagine Home Health with the Power of Agentic AI​
PTS Company Brochure 2025 (1).pdf.......
Digital Systems & Binary Numbers (comprehensive )
Designing Intelligence for the Shop Floor.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
System and Network Administraation Chapter 3
history of c programming in notes for students .pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Transform Your Business with a Software ERP System
Design an Analysis of Algorithms II-SECS-1021-03
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
medical staffing services at VALiNTRY
Navsoft: AI-Powered Business Solutions & Custom Software Development
Softaken Excel to vCard Converter Software.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Introduction to Artificial Intelligence
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms I-SECS-1021-03

Near Realtime Processing over HBase

  • 1. Near-­‐Real(me  Processing   over  HBase Ryan  Brush @ryanbrush
  • 2. Topics -The  story  so  far   -Complemen8ng  MapReduce  with   stream-­‐based  processing   -Techniques  and  lessons   -Query  and  search   -The  future
  • 3. The  story  so  far...
  • 5. Chart  Search -Informa8on   extrac8on   -Seman8c  markup  of   documents   -Related  concepts  in   search  results   -Processing  latency:   tens  of  minutes
  • 7. Medical  Alerts -Detect  health  risks   in  incoming  data   -No8fy  clinicians  to   address  those  risks   -Quickly  include  new   knowledge   -Processing  latency:   single-­‐digit  minutes
  • 9. Exploring  live  data -Novel  ways  of   exploring  records   -Pre-­‐computed   models  matching   users’  access   paLerns   -Very  fast  load  8mes   -Processing  latency:   seconds  or  faster
  • 10. And  many  others Popula(on  analy(cs Care  coordina(on Personalized  health   plans - Data  sets  growing  at  hundreds  of  GBs  per  day   - Approaching  1  petabyte  total  data   - Rate  is  increasing;  expec8ng  mul8-­‐petabyte   data  sets
  • 11. -Analyze  all  data  holis8cally   -Quickly  apply  incremental  updates A  trend  towards  compe8ng  needs
  • 12. A  trend  towards  compe8ng  needs MapReduce - (re-­‐)Process  all  data   - Move  computa8on  to  data   - Output  is  a  pure  func8on  of   the  input   - Assumes  set  of  sta8c  input Stream - Incremental  updates   - Move  data  to  computa8on   - Needs  to  clean  up  outdated   state   - Input  may  be  incomplete  or  out   of  order Both  processing  models  are  necessary   and  the  underlying  logic  must  be  the  same
  • 13. A  trend  towards  compe8ng  needs Speed Layer Batch Layer hLp://nathanmarz.com/blog/how-­‐to-­‐beat-­‐the-­‐cap-­‐theorem.html
  • 14. Speed Layer Batch Layer High  Latency  (minutes  or  hours   to  process) Low  Latency  (seconds  to  process) Move  data  to  computa(on Move  computa(on  to  dataYears  of  data Hours  of  data Bulk  loads Incremental  updates A  trend  towards  compe8ng  needs hLp://nathanmarz.com/blog/how-­‐to-­‐beat-­‐the-­‐cap-­‐theorem.html
  • 16. Into  the  rabbit  hole -A  ride  through  the  system   -Techniques  and  lessons  learned   along  the  way
  • 17. Data  inges8on -Stream  data  into  HTTPS  service   -Content  stored  as  Protocol  Buffers   -Mirror  the  raw  data  as  simply  as   possible /source:1/document:123 /source:2/allergy:345 /source:2/document:456 /source:2/order:234 … /source:n/prescription:789 HBase Collector Service Source System 1 Source System 2 Source System N . . . HTTPS
  • 18. Scan  for  updates Process  incoming  data -Ini8ally  modeled   aYer  Google   Percolator   -“No8fica8on”   records  indicate   changes   -Scan  for  no8fica8ons Data  Table source:1/document:123 source:2/allergy:345 source:2/document:456 .  .  .   source:150/order:71 No8fica8on  Table source:1/document:123 source:150/order:71
  • 19. But  there’s  a  catch… -Percolator-­‐style  no8fica8on  records   require  external  coordina8on   -More  infrastructure  to  build,   maintain   -…so  let’s  use  HBase’s  primi8ves
  • 20. Scan  for  updates Process  incoming  data - Consumers  scan  for  items  to  process   - Atomically  claim  lease  records  (CheckAndPut)   - Clear  the  record  and  no8fica8ons  when  done   - ~3000  no8fica8ons  per  second  per  node Row  Key Qualifiers  (lease  record  and  keys  of  updated  items) split:0 0000_LEASE,  source:2/allergy:345,  source:150/order:71,  … split:1 0000_LEASE,  source:4/problem:78,  source:205/document:52,  … .  .  .
  • 21. Advantages -No  addi8onal  infrastructure   -Leverages  HBase  guarantees   -No  lost  data   -No  stranded  data  due  to  machine  failure   -Robust  to  volume  spikes  of  tens  of   millions  of  records
  • 22. Downsides -Weak  ordering  guarantees   -Processing  must  be  idempotent   -Lots  of  garbage  from  deleted  cells   -Schedule  major  compac8ons!   -Must  split  to  avoid  hot  regions   -Poten8ally  beLer  op8ons  emerging   -Apache  Kana  with  replica8on
  • 23. Measure  Everything - Instrumented  HBase  client  to  see  effec8ve   performance   - We  use  Coda  Hale’s  Metrics  API  and  Graphite   Reporter   - Revealed  impact  of  hot  HBase  regions  on  clients
  • 24. The  story  so  far HBase Collector Service Source System 1 Source System 2 Source System N . . . HTTPS Data Notifications Incremental Processors Load data Scan for updates
  • 25. Into  the  Storm -Storm:  scalable  processing  of  data   in  mo8on   -Complements  HBase  and  Hadoop   -Guaranteed  message  processing  in   a  distributed  environment   -No8fica8ons  scanned  by  a  Storm   Spout
  • 26. Processing  with  Storm Collector Service Source System 1 Source System 2 Source System N . . . HTTPS Raw Data HBase Bolt Bolt BoltSpout Processed Data Apps Services
  • 27. Challenges  of  incremental  updates -Incomplete  data   -Outdated  state   -Difficult  to  reason  about  changing   state  and  8ming  condi8ons
  • 28. Handling  Incomplete  Data Row  Key Summary  Family Staging  Family document:1 page:1 Incoming  data - Process  (map)  components  into  a  staging   family
  • 29. Handling  Incomplete  Data Row  Key Summary  Family Staging  Family document:1 page:1                            page:3 Incoming  data - Process  (map)  components  into  a  staging   family
  • 30. Handling  Incomplete  Data Row  Key Summary  Family Staging  Family document:1 page:1  page:2  page:3 Incoming  data - Process  (map)  components  into  a  staging   family
  • 31. Handling  Incomplete  Data Row  Key Summary  Family Staging  Family document:1 document_summary page:1  page:2  page:3 - Process  (map)  components  into  a  staging   family   - Merge  (reduce)  components  when  everything   is  available     - Many  cases  need  no  merge  phase;  consuming   apps  simply  read  all  of  the  components Incoming  data
  • 32. Outdated  State Time  0:  Alice  lives  in  Chicago Time  1:  Alice  lives  in  New  York Incoming  Data Chicago  resident  index New  York  resident  index Processed  Data - Big  Data   - MapReduce:  rebuild  processed  data   - Outdated  state  is  simply  ignored   - Fast  Updates   - ACID  database:  simply  update  Alice’s  loca8on   - Big  and  Fast:  it  gets  complicated
  • 33. Outdated  State:  Reconcile  on  Read Historical Data (MapReduce Output) Incremental Updates Merge Application - Akin  to  Marz’s  Lambda  Architecture   - Data  stores  op8mized  for               specific  workloads   - Keeps  processing  models   independent   - Adds  complexity  at  read                           8me,  but  simpler  overall - Not  available  in  commodity  app  stacks   - Probably  best  approach  when  and  if  higher-­‐ level  abstrac8ons  emerge
  • 34. Outdated  State:  Reconcile  on  Write Time  0:  Alice  lives  in  Chicago Time  1:  Alice  lives  in  New  York Incoming  Data Chicago  resident  index New  York  resident  index Processed  Data - Keep  history  of  your  incoming  data   - When  the  event  at  Time  1  occurs,  read  that  history   and  update  both  indexes   - Works  with  many  exis8ng  data  stores   - Adds  complexity  to  processing  logic   - Data  store  must  handle  MapReduce  and  real8me   loads  -­‐-­‐  may  not  be  op8mal
  • 35. Different  models,  same  logic -Incremental  updates  like  a  rolling   MapReduce   -Func(ons  are  the  center  of  the   universe  (not  InputFormats  or  Messages)   -Write  logic  as  pure  func8ons,   coordinate  with  higher  libraries   - Storm   - Apache  Crunch
  • 36. Gesng  complicated? -Incremental  logic  is  complex  and   error  prone   -Use  MapReduce  as  a  failsafe Collector Service Source System 1 Source System 2 Source System N . . . HTTPS Raw Data HBase Bolt Bolt BoltSpout Processed Data MapReduce Apps Services
  • 37. Reprocess  during  up8me -Deploy  new  incremental  processing  logic   -“Older”  8mestamps  produced  by   MapReduce   -The  most  recently  wriLen  cell  in  HBase   need  not  be  the  logical  newest Row  Key Document  Family document:1 {doc,  ts=50} document:2 {doc,  ts=100} Real  8me  incremental  update ,  {doc,  ts=300} MapReduce  outputs ,  {doc  ts=200} ,  {doc,  ts=200}
  • 38. Comple8ng  the  Picture Collector Service Source System 1 Source System 2 Source System N . . . HTTPS Raw Data HBase Bolt Bolt BoltSpout Processed Data MapReduce Apps Services
  • 39. Comple8ng  the  Picture Collector Service Source System 1 Source System 2 Source System N . . . HTTPS Raw Data HBase Bolt Bolt BoltSpout Processed Data MapReduce Apps Services Search Indexes
  • 40. Building  indexes  with  MapReduce -A  shard  per  task   -Build  index  in  Hadoop   -Copy  to  index  hosts Embedded Solr Map Task Index Shard Embedded Solr Map Task Index Shard Embedded Solr Map Task Index Shard
  • 41. Pushing  incremental  updates -POST  new  records   -Bursts  can  overwhelm   target  hosts   -Consumers  must  deal   with  transient  failures Solr Shard Solr Shard Solr Shard Replica Replica Replica Processor Data stream
  • 42. Pulling  indexes  from  HBase - Custom  Solr  plugin   scans  a  range  of  HBase   rows   - Time-­‐based  scan  to  get   only  updates   - Pulls  items  to  index   from  HBase   - Cleanly  recovers  from   volume  spikes  and   transient  failures person:1 person:2 . . . person:n person:n + 1 …. person:m HBase Solr Shard Solr Shard Solr Scan Scan Scan
  • 43. A  note  on  schema:  simplify  it! -Heterogeneous  row   keys  efficient  but  hard   to  reason  about   -Must  inspect  row  key   to  know  what  it  is   -Mismatches  tools  like   Pig  or  Hive Row  Key Qualifiers person:1/name <content> person:1/address <content> person:1/friend:1 <content> person:1/friend:2 <content> person:2/name <content> … person:n/name <content> person:n/friend:m <content>
  • 44. Logical  parent  per  row -The  row  is  the  unit  of  locality   -Tabular  layout  is  easy  to  understand   -No  lost  efficiency  for  most  cases   -HBase  Schema  Design  -­‐-­‐  Ian  Varley  at  2012   HBaseCon Row  Key Qualifiers person:1 name<…>  address:<…>  friend:1:<…>  friend:2:<…> person:2 name<…>  address:<…>  friend:1:<…> .  .  . person:n name<…>  address:<…>  friend:1:<…>
  • 46. This  paMern  has  been  successful …but  complexity  is  our  biggest     enemy
  • 47. We  may  be  in  the  assembly   language  era  of  big  data
  • 48. Higher-­‐level  abstrac(ons  for   these  paMerns  will  emerge It’s  going  to  be  fun