SlideShare a Scribd company logo




Relevance - Deal Personalization and Real
Time Big Data Analytics
Prassnitha	
  Sampath	
  
psampath@groupon.com	
  
About Me
•  Lead	
  Engineer	
  working	
  on	
  Real	
  Time	
  Data	
  
Infrastructure	
  @	
  Groupon	
  
	
  
•  Graduate	
  of	
  Portland	
  State	
  and	
  Madras	
  
University	
  
What are Groupon Deals?
Our Relevance Scenario
Users	
  
Scaling: Keeping Up With a Changing Business
2014	
  2011	
   2012	
  
Growing	
  Number	
  of	
  deals	
   Growing	
  Users	
  
•  100	
  Million+	
  subscribers	
  
•  We	
  need	
  	
  to	
  store	
  data	
  
like,	
  user	
  click	
  history,	
  	
  
email	
  records,	
  service	
  
logs	
  etc.	
  This	
  is	
  billions	
  of	
  
data	
  points	
  and	
  TB’s	
  of	
  
data	
  
Changing Business: Shift from Email to Mobile
•  Growth	
  in	
  Mobile	
  
Business	
  
•  Reducing	
  dependence	
  on	
  
email	
  markeOng	
  
	
  
100	
  Million+	
  App	
  Downloads	
  
Deal Personalization Infrastructure Use Cases
Deliver Personalized
Emails
Deliver Personalized
Website & Mobile
Experience
Offline	
  System	
   Online	
  System	
  
Email	
  
Personalize	
  billions	
  of	
  emails	
  for	
  hundreds	
  
of	
  millions	
  of	
  users	
  
Personalize	
  one	
  of	
  the	
  most	
  popular	
  
e-­‐commerce	
  mobile	
  &	
  web	
  app	
  
for	
  hundreds	
  of	
  millions	
  of	
  users	
  &	
  page	
  views	
  
Deal Personalization Infrastructure Use Cases
Deliver Personalized
Website, Mobile and Email
Experience
Deal	
  Performance	
   Understand	
  User	
  Behavior	
  
Deliver Relevant Experience
with High Quality Deals
Earlier System
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Data	
  Pipeline	
  (User	
  Logs,	
  Email	
  Records,	
  User	
  History	
  etc)	
  
Online	
  Deal	
  
PersonalizaOon	
  	
  
API	
  
MySQL	
  Store	
  
Email	
  
Earlier System
Email	
  
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Data	
  Pipeline	
  
Online	
  Deal	
  
PersonalizaOon	
  	
  
API	
  
MySQL	
  Store	
  
•  	
  Scaling	
  MySQL	
  for	
  data	
  
such	
  as	
  user	
  click	
  history,	
  
email	
  records	
  was	
  
painful	
  unless	
  we	
  shard	
  
data	
  
•  Data	
  Pipeline	
  is	
  not	
  
“Real	
  Time”	
  
Email	
  
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Real	
  Time	
  Data	
  
Pipeline	
  
Online	
  Deal	
  
PersonalizaOon	
  	
  
API	
  
Ideal	
  Data	
  Store	
  
•  Common	
  data	
  store	
  that	
  
serves	
  data	
  to	
  both	
  online	
  
and	
  offline	
  systems	
  
•  Data	
  store	
  that	
  scales	
  to	
  
hundreds	
  of	
  millions	
  of	
  
records	
  
•  Data	
  store	
  that	
  works	
  well	
  
with	
  our	
  exisOng	
  Hadoop	
  
based	
  systems	
  
•  Real	
  Time	
  pipeline	
  that	
  scales	
  
and	
  can	
  process	
  about	
  
100,000	
  messages/	
  second	
  
Ideal System
Email	
  
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Web	
  Site	
  	
  
Logs	
  
Online	
  Deal	
  
PersonalizaOon	
  	
  
API	
  
HBase	
  
Final Design
Mobile	
  	
  
Logs	
  
Ka`a	
  Message	
  Broker	
  
Storm	
  
Two Challenges With HBase
HBase	
  
How	
  to	
  scale	
  
100,000	
  	
  
writes/	
  second?	
  
HBase	
  
•  How	
  to	
  run	
  Map	
  Reduce	
  Programs	
  
over	
  HBase	
  without	
  affecOng	
  read	
  
latency?	
  
•  How	
  to	
  batch	
  load	
  data	
  in	
  HBase	
  	
  
without	
  affecOng	
  read	
  latencies?	
  
	
  
Final Hbase Design
Real	
  Time	
  
HBase	
  
Batch	
  
HBase	
  
Bulk	
  Load	
  	
  data	
  via	
  
HFiles	
  
ReplicaOon	
  
Map	
  Reduce	
  Over	
  
HBase	
  
Leveraging System for Real Time Analytics
	
  Various	
  requirements	
  from	
  relevance	
  algorithms	
  to	
  pre-­‐
compute	
  real	
  6me	
  analy6cs	
  for	
  be9er	
  targe6ng	
  
	
  
	
  
Category	
  Level	
  
MulOdimensional	
  
Performance	
  
Metrics	
  	
  	
  
	
  
Deal	
  Level	
  
Performance	
  
Metrics	
  
	
  How	
  do	
  	
  women	
  in	
  Dublin	
  
convert	
  for	
  Pizza	
  deals?	
  	
  
How	
  do	
  women	
  in	
  Dublin	
  
convert	
  for	
  a	
  parOcular	
  pizza	
  
deal?	
  	
  
Leveraging System for Real Time Analytics
	
  More	
  Complex	
  Examples	
  
	
  
	
  
Category	
  Level	
  
MulOdimensional	
  
Performance	
  Metrics	
  	
  	
  
	
  
Deal	
  Level	
  
Performance	
  Metrics	
  
	
  How	
  do	
  women	
  in	
  Dublin	
  
from	
  the	
  Dundrum	
  area	
  aged	
  
30-­‐35	
  convert	
  for	
  New	
  York	
  
Style	
  Pizza,	
  when	
  deal	
  is	
  
located	
  within	
  2	
  miles,	
  and	
  
when	
  deal	
  is	
  priced	
  between	
  
€10-­‐€20?	
  	
  
	
  How	
  do	
  women	
  in	
  Dublin	
  from	
  Dundrum	
  area	
  
aged	
  30-­‐35	
  convert	
  for	
  a	
  parOcular	
  deal?	
  
Leveraging System for Real Time Analytics
Even	
  More	
  Complex	
  Examples	
  
	
  
	
  How	
  do	
  women	
  in	
  Dublin	
  
from	
  the	
  Dundrum	
  area	
  
aged	
  30-­‐35	
  who	
  also	
  like	
  
acOviOes	
  like	
  Biking	
  and	
  are	
  	
  
acOve	
  customers	
  on	
  our	
  
mobile	
  plahorm	
  convert	
  
when	
  deal	
  is	
  located	
  within	
  
2	
  miles,	
  and	
  when	
  deal	
  is	
  
priced	
  between	
  €10-­‐€20?	
  	
  
	
  How	
  do	
  women	
  in	
  Dublin	
  from	
  
the	
  Dundrum	
  area	
  aged	
  30-­‐35	
  
who	
  also	
  like	
  acOviOes	
  such	
  as	
  
biking	
  and	
  are	
  acOve	
  customers	
  
of	
  Groupon	
  deals	
  on	
  mobile	
  
plahorm	
  convert	
  for	
  this	
  
parOcular	
  deal?	
  
Power of Simple Counting
Turns	
  out	
  all	
  earlier	
  quesOons	
  can	
  be	
  answered	
  if	
  we	
  could	
  count	
  appropriate	
  events	
  in	
  
appropriate	
  bucket	
  	
  	
  	
  
No	
  Deal	
  Impressions	
  by	
  Women	
  in	
  Dublin	
  for	
  Pizza	
  Deals	
  	
  	
  
No	
  of	
  Purchases	
  by	
  Women	
  in	
  Dublin	
  for	
  
Pizza	
  Deals	
  Conversion	
  rate	
  
for	
  pizza	
  deals	
  
for	
  women	
  in	
  
Dublin	
  
=	
  
Real Time Analytics Infrastructure
Ka`a	
  Topic	
  –	
  With	
  Real	
  Time	
  User	
  
events	
  
Storm	
  –	
  Running	
  AnalyOcs	
  Topology	
  
Real	
  Time	
  infrastructure	
  processing	
  	
  
100,000	
  requests/	
  second	
  
Redis	
  1	
   …	
  
Storm	
  Topology	
  calculaOng	
  various	
  
dimensions/	
  buckets	
  and	
  updates	
  
appropriate	
  Redis	
  bucket.	
  Redis	
  is	
  
sharded	
  from	
  client	
  side	
  
Redis	
  cluster	
  handles	
  over	
  3	
  Million	
  
events	
  per	
  second.	
  Stores	
  over	
  14	
  
Billion	
  unique	
  keys	
  
Redis	
  2	
   Redis	
  N	
  
Real Time Analytics Infrastructure -
Explained
Ka`a	
  Topic	
  –	
  
With	
  Real	
  
Time	
  User	
  
events	
  
Read	
  user	
  
event	
  Data	
  
from	
  Ka`a	
  
Find	
  out	
  
which	
  all	
  
buckets	
  this	
  
event	
  falls	
  
Increase	
  event	
  
counter	
  for	
  
appropriate	
  
bucket	
  in	
  Redis	
  
Redis	
  
Shards	
  
Storm	
  
Scaling Challenges - Kafka - Storm
	
  
•  Storm	
  was	
  hard	
  to	
  scale.	
  We	
  had	
  to	
  try	
  various	
  number	
  of	
  combinaOons	
  to	
  
finalize	
  how	
  many	
  bolts	
  of	
  each	
  type	
  are	
  required	
  for	
  steady	
  state	
  
operaOons	
  and	
  overall	
  how	
  many	
  workers	
  are	
  needed.	
  
•  Use	
  “topology.max.spout.pending”	
  senng	
  in	
  Storm	
  topologies.	
  We	
  found	
  
it	
  to	
  be	
  very	
  useful	
  to	
  shield	
  your	
  topologies	
  from	
  sudden	
  surge	
  in	
  traffic.	
  
•  Build	
  your	
  enOre	
  infrastructure	
  –	
  where	
  data	
  duplicates	
  are	
  allowed	
  
Scaling Challenges - Redis
•  Reduce	
  memory	
  footprint	
  –	
  	
  use	
  hashes.	
  Very	
  memory	
  
efficient	
  compared	
  to	
  normal	
  Redis	
  keys	
  	
  
•  In	
  order	
  to	
  support	
  high	
  write	
  operaOons	
  turned	
  off	
  AOF,	
  
turned	
  on	
  RDB	
  backups	
  
Easiest	
  of	
  all	
  other	
  infrastructure	
  pieces	
  –	
  Ka`a,	
  Storm,	
  HBase	
  
When Small is Big – Bloom Filters
•  Since	
  both	
  Ka`a	
  and	
  Storm	
  can	
  send	
  same	
  data	
  twice	
  specially	
  at	
  
scale,	
  it	
  was	
  important	
  to	
  build	
  downstream	
  infrastructure	
  that	
  can	
  
handle	
  duplicate	
  data.	
  
•  However,	
  by	
  very	
  nature	
  AnalyOcs	
  Topology	
  (CounOng	
  Topology)	
  
cannot	
  handle	
  duplicates	
  
•  Storing	
  individual	
  messages	
  for	
  billions	
  of	
  messages	
  is	
  way	
  too	
  
expensive	
  and	
  would	
  take	
  lot	
  more	
  memory	
  
	
  
•  So	
  we	
  used	
  bloom	
  filters.	
  At	
  a	
  very	
  small	
  %	
  error	
  rate,	
  we	
  could	
  
effecOvely	
  de-­‐dupe	
  data	
  with	
  a	
  very	
  small	
  memory	
  footprint.	
  
Avoiding Errors – Backups/ Recovery
Strategy
For	
  a	
  high	
  volume	
  system,	
  which	
  also	
  drives	
  so	
  much	
  revenue	
  for	
  the	
  company	
  good	
  
backup/recovery	
  strategy	
  is	
  necessary	
  
Redis	
  
	
  
RDB	
  Backups	
  every	
  
few	
  hours.	
  RDB	
  
backups	
  are	
  stored	
  
in	
  HDFS	
  for	
  later	
  
use	
  	
  	
  
HBase	
  
	
  
HBase	
  Snapshot	
  
funcOonality	
  is	
  
used.	
  Snapshot	
  are	
  
taken	
  every	
  few	
  
hours.	
  	
  
Ka`a/	
  Storm	
  
	
  
All	
  input	
  into	
  Ka`a	
  
topic	
  is	
  stored	
  in	
  
HDFS	
  for	
  30	
  days.	
  
So	
  any	
  hour/	
  day	
  
can	
  be	
  replayed	
  
from	
  HDFS	
  if	
  
necessary.	
  
Monitoring
Overall end-to-end monitoring to test the complete flow of data
Ka`a	
  -­‐>	
  Storm	
  -­‐>	
  HBase	
  Pipeline	
  
Crawler	
  crawls	
  the	
  page	
  and	
  monitoring	
  looks	
  for	
  corresponding	
  data	
  in	
  HBase	
  
psampath@groupon.com
www.groupon.com/techjobs
Ques6ons?	
  
Thank	
  you!	
  
Slides	
  prepared	
  in	
  collabora/on	
  with	
  Ameya	
  Kanitkar	
  

More Related Content

PDF
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
PPTX
Real Time Conversion Joins Using Storm and HBase
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
PDF
16h00 globant - aws globant-big-data_summit2012
PDF
Amazon big success using big data analytics
PDF
Cloud Connect 2012, Big Data @ Netflix
PPTX
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
PPTX
Stream me to the Cloud (and back) with Confluent & MongoDB
Ameya Kanitkar – Scaling Real Time Analytics with Storm & HBase - NoSQL matte...
Real Time Conversion Joins Using Storm and HBase
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
16h00 globant - aws globant-big-data_summit2012
Amazon big success using big data analytics
Cloud Connect 2012, Big Data @ Netflix
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Stream me to the Cloud (and back) with Confluent & MongoDB

Similar to Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015 (20)

PDF
Big data landscape
PPTX
Real Time Analytics
PPTX
Real Time Analytics
PPT
Case Study: Realtime Analytics with Druid
PDF
Seminaire bigdata23102014
PPTX
Are you ready for BIG DATA?
PPTX
NoSQL Type, Bigdata, and Analytics
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
PPTX
Apache Spark Streaming -Real time web server log analytics
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
PDF
Realtime hadoopsigmod2011
KEY
Lean & agile with MongoDB
PPTX
dumb
PPTX
dumb
PPTX
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
PDF
Building data intensive applications
PDF
Analyzing Multi-Structured Data
PPTX
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
PDF
Relevance of time series databases & druid.io
PPTX
Big data landscape
Real Time Analytics
Real Time Analytics
Case Study: Realtime Analytics with Druid
Seminaire bigdata23102014
Are you ready for BIG DATA?
NoSQL Type, Bigdata, and Analytics
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Apache Spark Streaming -Real time web server log analytics
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
Realtime hadoopsigmod2011
Lean & agile with MongoDB
dumb
dumb
Big Data Business Wins: Real-time Inventory Tracking with Hadoop
Building data intensive applications
Analyzing Multi-Structured Data
Big Data, Hadoop, Hortonworks and Microsoft HDInsight
Relevance of time series databases & druid.io
Ad

More from NoSQLmatters (20)

PDF
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
PDF
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
PDF
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
PDF
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PDF
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
PDF
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
PDF
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
PDF
Chris Ward - Understanding databases for distributed docker applications - No...
PDF
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
PDF
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
PDF
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
PDF
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
PDF
David Pilato - Advance search for your legacy application - NoSQL matters Par...
PDF
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
PDF
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
PDF
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Chris Ward - Understanding databases for distributed docker applications - No...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
David Pilato - Advance search for your legacy application - NoSQL matters Par...
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Lecture1 pattern recognition............
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Fluorescence-microscope_Botany_detailed content
Supervised vs unsupervised machine learning algorithms
.pdf is not working space design for the following data for the following dat...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Analytics and business intelligence.pdf
Lecture1 pattern recognition............
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
IBA_Chapter_11_Slides_Final_Accessible.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf

Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

  • 1. 
 
 Relevance - Deal Personalization and Real Time Big Data Analytics Prassnitha  Sampath   psampath@groupon.com  
  • 2. About Me •  Lead  Engineer  working  on  Real  Time  Data   Infrastructure  @  Groupon     •  Graduate  of  Portland  State  and  Madras   University  
  • 5. Scaling: Keeping Up With a Changing Business 2014  2011   2012   Growing  Number  of  deals   Growing  Users   •  100  Million+  subscribers   •  We  need    to  store  data   like,  user  click  history,     email  records,  service   logs  etc.  This  is  billions  of   data  points  and  TB’s  of   data  
  • 6. Changing Business: Shift from Email to Mobile •  Growth  in  Mobile   Business   •  Reducing  dependence  on   email  markeOng     100  Million+  App  Downloads  
  • 7. Deal Personalization Infrastructure Use Cases Deliver Personalized Emails Deliver Personalized Website & Mobile Experience Offline  System   Online  System   Email   Personalize  billions  of  emails  for  hundreds   of  millions  of  users   Personalize  one  of  the  most  popular   e-­‐commerce  mobile  &  web  app   for  hundreds  of  millions  of  users  &  page  views  
  • 8. Deal Personalization Infrastructure Use Cases Deliver Personalized Website, Mobile and Email Experience Deal  Performance   Understand  User  Behavior   Deliver Relevant Experience with High Quality Deals
  • 9. Earlier System Offline   PersonalizaOon   Map/Reduce   Data  Pipeline  (User  Logs,  Email  Records,  User  History  etc)   Online  Deal   PersonalizaOon     API   MySQL  Store   Email  
  • 10. Earlier System Email   Offline   PersonalizaOon   Map/Reduce   Data  Pipeline   Online  Deal   PersonalizaOon     API   MySQL  Store   •   Scaling  MySQL  for  data   such  as  user  click  history,   email  records  was   painful  unless  we  shard   data   •  Data  Pipeline  is  not   “Real  Time”  
  • 11. Email   Offline   PersonalizaOon   Map/Reduce   Real  Time  Data   Pipeline   Online  Deal   PersonalizaOon     API   Ideal  Data  Store   •  Common  data  store  that   serves  data  to  both  online   and  offline  systems   •  Data  store  that  scales  to   hundreds  of  millions  of   records   •  Data  store  that  works  well   with  our  exisOng  Hadoop   based  systems   •  Real  Time  pipeline  that  scales   and  can  process  about   100,000  messages/  second   Ideal System
  • 12. Email   Offline   PersonalizaOon   Map/Reduce   Web  Site     Logs   Online  Deal   PersonalizaOon     API   HBase   Final Design Mobile     Logs   Ka`a  Message  Broker   Storm  
  • 13. Two Challenges With HBase HBase   How  to  scale   100,000     writes/  second?   HBase   •  How  to  run  Map  Reduce  Programs   over  HBase  without  affecOng  read   latency?   •  How  to  batch  load  data  in  HBase     without  affecOng  read  latencies?    
  • 14. Final Hbase Design Real  Time   HBase   Batch   HBase   Bulk  Load    data  via   HFiles   ReplicaOon   Map  Reduce  Over   HBase  
  • 15. Leveraging System for Real Time Analytics  Various  requirements  from  relevance  algorithms  to  pre-­‐ compute  real  6me  analy6cs  for  be9er  targe6ng       Category  Level   MulOdimensional   Performance   Metrics         Deal  Level   Performance   Metrics    How  do    women  in  Dublin   convert  for  Pizza  deals?     How  do  women  in  Dublin   convert  for  a  parOcular  pizza   deal?    
  • 16. Leveraging System for Real Time Analytics  More  Complex  Examples       Category  Level   MulOdimensional   Performance  Metrics         Deal  Level   Performance  Metrics    How  do  women  in  Dublin   from  the  Dundrum  area  aged   30-­‐35  convert  for  New  York   Style  Pizza,  when  deal  is   located  within  2  miles,  and   when  deal  is  priced  between   €10-­‐€20?      How  do  women  in  Dublin  from  Dundrum  area   aged  30-­‐35  convert  for  a  parOcular  deal?  
  • 17. Leveraging System for Real Time Analytics Even  More  Complex  Examples      How  do  women  in  Dublin   from  the  Dundrum  area   aged  30-­‐35  who  also  like   acOviOes  like  Biking  and  are     acOve  customers  on  our   mobile  plahorm  convert   when  deal  is  located  within   2  miles,  and  when  deal  is   priced  between  €10-­‐€20?      How  do  women  in  Dublin  from   the  Dundrum  area  aged  30-­‐35   who  also  like  acOviOes  such  as   biking  and  are  acOve  customers   of  Groupon  deals  on  mobile   plahorm  convert  for  this   parOcular  deal?  
  • 18. Power of Simple Counting Turns  out  all  earlier  quesOons  can  be  answered  if  we  could  count  appropriate  events  in   appropriate  bucket         No  Deal  Impressions  by  Women  in  Dublin  for  Pizza  Deals       No  of  Purchases  by  Women  in  Dublin  for   Pizza  Deals  Conversion  rate   for  pizza  deals   for  women  in   Dublin   =  
  • 19. Real Time Analytics Infrastructure Ka`a  Topic  –  With  Real  Time  User   events   Storm  –  Running  AnalyOcs  Topology   Real  Time  infrastructure  processing     100,000  requests/  second   Redis  1   …   Storm  Topology  calculaOng  various   dimensions/  buckets  and  updates   appropriate  Redis  bucket.  Redis  is   sharded  from  client  side   Redis  cluster  handles  over  3  Million   events  per  second.  Stores  over  14   Billion  unique  keys   Redis  2   Redis  N  
  • 20. Real Time Analytics Infrastructure - Explained Ka`a  Topic  –   With  Real   Time  User   events   Read  user   event  Data   from  Ka`a   Find  out   which  all   buckets  this   event  falls   Increase  event   counter  for   appropriate   bucket  in  Redis   Redis   Shards   Storm  
  • 21. Scaling Challenges - Kafka - Storm   •  Storm  was  hard  to  scale.  We  had  to  try  various  number  of  combinaOons  to   finalize  how  many  bolts  of  each  type  are  required  for  steady  state   operaOons  and  overall  how  many  workers  are  needed.   •  Use  “topology.max.spout.pending”  senng  in  Storm  topologies.  We  found   it  to  be  very  useful  to  shield  your  topologies  from  sudden  surge  in  traffic.   •  Build  your  enOre  infrastructure  –  where  data  duplicates  are  allowed  
  • 22. Scaling Challenges - Redis •  Reduce  memory  footprint  –    use  hashes.  Very  memory   efficient  compared  to  normal  Redis  keys     •  In  order  to  support  high  write  operaOons  turned  off  AOF,   turned  on  RDB  backups   Easiest  of  all  other  infrastructure  pieces  –  Ka`a,  Storm,  HBase  
  • 23. When Small is Big – Bloom Filters •  Since  both  Ka`a  and  Storm  can  send  same  data  twice  specially  at   scale,  it  was  important  to  build  downstream  infrastructure  that  can   handle  duplicate  data.   •  However,  by  very  nature  AnalyOcs  Topology  (CounOng  Topology)   cannot  handle  duplicates   •  Storing  individual  messages  for  billions  of  messages  is  way  too   expensive  and  would  take  lot  more  memory     •  So  we  used  bloom  filters.  At  a  very  small  %  error  rate,  we  could   effecOvely  de-­‐dupe  data  with  a  very  small  memory  footprint.  
  • 24. Avoiding Errors – Backups/ Recovery Strategy For  a  high  volume  system,  which  also  drives  so  much  revenue  for  the  company  good   backup/recovery  strategy  is  necessary   Redis     RDB  Backups  every   few  hours.  RDB   backups  are  stored   in  HDFS  for  later   use       HBase     HBase  Snapshot   funcOonality  is   used.  Snapshot  are   taken  every  few   hours.     Ka`a/  Storm     All  input  into  Ka`a   topic  is  stored  in   HDFS  for  30  days.   So  any  hour/  day   can  be  replayed   from  HDFS  if   necessary.  
  • 25. Monitoring Overall end-to-end monitoring to test the complete flow of data Ka`a  -­‐>  Storm  -­‐>  HBase  Pipeline   Crawler  crawls  the  page  and  monitoring  looks  for  corresponding  data  in  HBase  
  • 26. psampath@groupon.com www.groupon.com/techjobs Ques6ons?   Thank  you!   Slides  prepared  in  collabora/on  with  Ameya  Kanitkar