SlideShare a Scribd company logo
USING HADOOP & HBASE
TO BUILD CONTENT
RELEVANCE &
PERSONALIZATION
Tools to build your big data application
Ameya Kanitkar
Ameya Kanitkar – That‟s me!
• Big Data Infrastructure Engineer @ Groupon, Palo Alto

USA (Working on Deal Relevance & Personalization
Systems)
ameya.kanitkar@gmail.com
http://www.linkedin.com/in/ameyakanitkar
@aktwits
Agenda
 Basics of Hadoop & HBase
 How you can use Hadoop & HBase for big data

application
 Case Study: Deal Relevance and Personalization

Systems at Groupon with Hadoop & HBase
Big Data Application Examples
 Recommendation Systems
 Ad targeting
 Personalization Systems
 BI/ DW
 Log Analysis
 Natural Language Processing
So what is Hadoop?
 General purpose framework for processing huge

amounts of data.
 Open Source

 Batch / Offline Oriented
Hadoop - HDFS
 Open Source Distributed File System.

 Store large files. Can easily be accessed via application

built on top of HDFS.
 Data is distributed and replicated over multiple machines
 Linux Style commands eg. ls, cp, mv, touchz etc
Hadoop – HDFS
 Example:

hadoop fs –dus /data/
185453399927478 bytes =~ 168 TB

(One of the folders from one of our hadoop cluster)
Hadoop – Map Reduce
 Application Framework built on top of HDFS to process

your big data
 Operates on key-value pairs
 Mappers filter and transform input data
 Reducers aggregate mapper output
Example
• Given web logs, calculate landing page conversion rate

for each product

• So basically we need to see how many impressions each

product received and then calculate conversion rate of for
each product
Map Reduce Example
Map Phase

Reduce Phase

Map 1: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map 2: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map N: Process Log File:
Output: Key (Product ID), Value
(Impression Count)

Reducer: Here we receive all
data for a given product. Just run
simple for loop to calculate
conversion rate.
(Output: Product ID, Conversion
Rate
Recap
 We just processed terabytes of data, and calculated

conversion rate across millions of products.
 Note: This is batch process only. It takes time. You can

not start this process after some one visits your website.

How about we generate recommendations in batch process
and serve them in real time?
HBase
 Provides real time random read/ write access over HDFS

 Built on Google‟s „Big Table‟ design
 Open Sourced

This is not RDBMS, so no joins. Access patterns are

generally simple like get(key), put(key, value) etc.
Row

Cf:<qual>

Cf:<qual>

Row 1

Cf1:qual1

Cf1:qual2

Row 11

Cf1:qual2

Cf1:qual22

Row 2

….

Cf2:qual1

Cf1:qual3

Row N

 Dynamic Column Names. No need to define columns upfront.
 Both rows and columns are (lexicological) sorted

Cf:<qual>
….

Row

Cf:<qual>

user1

Cf1:click_history:{actual_cl Cf1:purchases:{actual_pur
icks_data}
chases}

user11

Cf1:purchases:{actual_pur
chases}

user20

Cf1:mobile_impressions:{a Cf1:purchases:{actual_pur
ctual mobile impressions} chases}

Note: Each row has different columns, So think about this as a hash map rather
than at table with rows and columns
Putting it all together
Store data in
HDFS

Web
Generate
Recommendations
(Map Reduce)

Serve Real Time
Requests
(HBase)

Analyze Data
(Map Reduce)

Do offline analysis in Hadoop, and serve real time requests with HBase

Mobile
Use Case: Deal Relevance &
Personalization @ Groupon
What are Groupon Deals?
Our Relevance Scenario
Users
Our Relevance Scenario
How do we surface relevant
deals ?
Users
 Deals are perishable (Deals
expire or are sold out)
 No direct user intent (As in
traditional search
advertising)

 Relatively Limited User
Information
 Deals are highly local
Two Sides to the Relevance Problem

Algorithmic
Issues

Scaling
Issues

How to find
relevant deals for
individual users
given a set of
optimization criteria

How to handle
relevance for
all users across
multiple
delivery platforms
Developing Deal Ranking Algorithms
• Exploring Data
• Understanding signals, finding

patterns

• Building Models/Heuristics
• Employ both classical machine

learning techniques and heuristic
adjustments to estimate user
purchasing behavior

• Conduct Experiments
• Try out ideas on real users and

evaluate their effect
Data Infrastructure
Growing Deals
2011

2012

Growing Users
2013

 100 Million+
subscribers

 We need to store data

20+

like, user click history,
400+

email records, service

logs etc. This tunes to
2000+

billions of data points
and TB‟s of data
Deal Personalization Infrastructure Use
Cases
• Deliver Personalized

Emails

• Deliver Personalized

Website & Mobile
Experience

Email

Personalize billions of emails for
hundredsof millions of users

Offline System

Personalize one of the most popular
e-commerce mobile & web app
for hundreds of millions of
users & page views

Online System
Architecture
• We can now
maintain different
SLA on online and
offline systems

Email

Real Time
Relevance

Relevance
Map/Reduce

HBase
Offline
System

Data Pipeline

Replication

HBase for
Online System

• We can tune
HBase cluster
differently for
online and offline
systems
HBase Schema Design
User ID

Column Family 1

Column Family 2

Unique Identifier
for Users

User History and
Profile Information

Email History For Users

Overwrite user history
and profile info

Append email history for
each day as a separate
columns. (On avg each
row has over 200
columns)

• Most of our data access patterns are via “User Key”
• This makes it easy to design HBase schema
• The actual data is kept in JSON
Cluster Sizing
HBase
Replication

Hadoop +
HBase
Cluster

100+ machine Hadoop
cluster, this runs heavy
map reduce jobs
The same cluster also
hosts 15 node HBase
cluster

Online HBase
Cluster

10 Machine
dedicated HBase
cluster to serve
real time SLA

• Machine Profile
• 96 GB RAM (HBase
25 GB)
• 24 Virtual Cores
CPU
• 8 2TB Disks
• Data Profile
• 100 Million+
Records
• 2TB+ Data
• Over 4.2 Billion Data
Points
Questions?

Thank You!
(We are hiring!)
www.groupon.com/techjobs

More Related Content

PDF
Big Data Architecture and Design Patterns
PPTX
Loan Decisioning Transformation
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Summer Shorts: Big Data Integration
 
PDF
HBaseCon 2015: Running ML Infrastructure on HBase
PPTX
Rich Data Graphs for MapReduce
PDF
Yahoo's Next Generation User Profile Platform
PPTX
The Right (and Wrong) Use Cases for MongoDB
Big Data Architecture and Design Patterns
Loan Decisioning Transformation
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Summer Shorts: Big Data Integration
 
HBaseCon 2015: Running ML Infrastructure on HBase
Rich Data Graphs for MapReduce
Yahoo's Next Generation User Profile Platform
The Right (and Wrong) Use Cases for MongoDB

What's hot (11)

PPTX
Design Patterns for Building 360-degree Views with HBase and Kiji
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
MongoDB: What, why, when
PDF
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
PDF
PPTX
Big Data - Part I
PPTX
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
PPTX
Hands On: Javascript SDK
PPTX
Big Data - Part II
PDF
Customer Experience at Disney+ Through Data Perspective
PPTX
Big Data - Part IV
Design Patterns for Building 360-degree Views with HBase and Kiji
AWS Big Data Demystified #1: Big data architecture lessons learned
MongoDB: What, why, when
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
Big Data - Part I
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hands On: Javascript SDK
Big Data - Part II
Customer Experience at Disney+ Through Data Perspective
Big Data - Part IV
Ad

Viewers also liked (20)

PPS
Water Disaster
PDF
Managing benefits from projects - the NHS way - 23rd Sept 2015
PPTX
Pairing with the queen
PDF
Campaignion (re:campaign2013)
PPS
T H E L A T E S T T E C H
PPTX
Pizza do amor
PDF
SEOGuardian - Lencería Online - Informe SEO y SEM
PPTX
Reputation management tips from Shashi Bellamkonda of Network Solutions
PPTX
Marcus Taylor - Getting Practical: Facebook Marketing (Darker Music Talks Jun...
PPTX
What Women Want?
PPTX
Ejyle company profile
PPTX
Agile Australia 2016 - Rescuing Legacy Software from Impending Doom
PDF
020.guerra.civil. .x-factor.v2.08.hq.br.07 mar07.os.impossiveis.br.gibihq
PDF
Ghana Capability Statement
DOC
DeeDeeMikasa Resume
PPT
Patrick Zandl: Energy industry post Edison, Křižík & IoT
PDF
20120513 repeatsinsymbolicsequences shur_lecture05-06
PDF
Benjamin Holmquist - Rhetorical Criticism Project
PPTX
Chapter 48
PDF
20080309 efficientalgorithms kulikov_lecture15
Water Disaster
Managing benefits from projects - the NHS way - 23rd Sept 2015
Pairing with the queen
Campaignion (re:campaign2013)
T H E L A T E S T T E C H
Pizza do amor
SEOGuardian - Lencería Online - Informe SEO y SEM
Reputation management tips from Shashi Bellamkonda of Network Solutions
Marcus Taylor - Getting Practical: Facebook Marketing (Darker Music Talks Jun...
What Women Want?
Ejyle company profile
Agile Australia 2016 - Rescuing Legacy Software from Impending Doom
020.guerra.civil. .x-factor.v2.08.hq.br.07 mar07.os.impossiveis.br.gibihq
Ghana Capability Statement
DeeDeeMikasa Resume
Patrick Zandl: Energy industry post Edison, Křižík & IoT
20120513 repeatsinsymbolicsequences shur_lecture05-06
Benjamin Holmquist - Rhetorical Criticism Project
Chapter 48
20080309 efficientalgorithms kulikov_lecture15
Ad

Similar to Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users (20)

PPTX
PPTX
Big Data Analytics with Hadoop
PDF
Big_data_1674238705.ppt is a basic background
PPT
Hadoop Demo eConvergence
PPTX
BIG Data & Hadoop Applications in Retail
PDF
Hadoop Master Class : A concise overview
PPTX
Big Data Practice_Planning_steps_RK
PPT
PDF
Dba to data scientist -Satyendra
PDF
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PPT
Final deck
PPTX
Hadoop as data refinery
PPTX
Hadoop as Data Refinery - Steve Loughran
ODP
Hadoop introduction
PDF
Big Data Use Cases – Hadoop, Spark and Flink Case Studies.pdf
PPTX
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
PDF
Rajesh Angadi Brochure
PPTX
BIG Data & Hadoop Applications in E-Commerce
Big Data Analytics with Hadoop
Big_data_1674238705.ppt is a basic background
Hadoop Demo eConvergence
BIG Data & Hadoop Applications in Retail
Hadoop Master Class : A concise overview
Big Data Practice_Planning_steps_RK
Dba to data scientist -Satyendra
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Lecture 5 - Big Data and Hadoop Intro.ppt
Final deck
Hadoop as data refinery
Hadoop as Data Refinery - Steve Loughran
Hadoop introduction
Big Data Use Cases – Hadoop, Spark and Flink Case Studies.pdf
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
Rajesh Angadi Brochure
BIG Data & Hadoop Applications in E-Commerce

More from WebExpo (20)

PDF
Jakub Vrána: Code Reviews with Phabricator
PDF
Jaroslav Šnajdr: Getting a Business Collaboration Service Into Cloud: A Case ...
PDF
Steve Corona: Scaling LAMP doesn't have to suck
PDF
Adii Pienaar: Lessons learnt running a global startup from the edge of the world
PPTX
Marli Mesibov - What's in a Story?
PDF
Tomáš Procházka: Moje zápisky z designu
PDF
Jiří Knesl: Souboj frameworků
PPTX
Richard Fridrich: Buď punkový konzument!
PDF
Jakub Nešetřil: Jak (ne)dělat API
PDF
Michal Blažej: Zbavte sa account managementu
PDF
Denisa Lorencová: UX Designer - Anděl s ďáblem v těle
PDF
Petr Ludwig: Jak bojovat s prokrastinací?
PDF
Jan Vlček: Gamifikace 101
PDF
Luke Wroblewski: Mobile First
PDF
Adam Hrubý: Evoluce designéra
PDF
Jan Sotorník: Grafika e-shopu jako sexy a chytrá prodavačka
PDF
Jana Štěpánová: Neziskovky Goes Web
PDF
Douglas Crockford: Serversideness
PPTX
Richard Fridrich: 5 x *, * a */5
PDF
Jiří Močička: Design as Storytelling
Jakub Vrána: Code Reviews with Phabricator
Jaroslav Šnajdr: Getting a Business Collaboration Service Into Cloud: A Case ...
Steve Corona: Scaling LAMP doesn't have to suck
Adii Pienaar: Lessons learnt running a global startup from the edge of the world
Marli Mesibov - What's in a Story?
Tomáš Procházka: Moje zápisky z designu
Jiří Knesl: Souboj frameworků
Richard Fridrich: Buď punkový konzument!
Jakub Nešetřil: Jak (ne)dělat API
Michal Blažej: Zbavte sa account managementu
Denisa Lorencová: UX Designer - Anděl s ďáblem v těle
Petr Ludwig: Jak bojovat s prokrastinací?
Jan Vlček: Gamifikace 101
Luke Wroblewski: Mobile First
Adam Hrubý: Evoluce designéra
Jan Sotorník: Grafika e-shopu jako sexy a chytrá prodavačka
Jana Štěpánová: Neziskovky Goes Web
Douglas Crockford: Serversideness
Richard Fridrich: 5 x *, * a */5
Jiří Močička: Design as Storytelling

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.

Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users

  • 1. USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar
  • 2. Ameya Kanitkar – That‟s me! • Big Data Infrastructure Engineer @ Groupon, Palo Alto USA (Working on Deal Relevance & Personalization Systems) ameya.kanitkar@gmail.com http://www.linkedin.com/in/ameyakanitkar @aktwits
  • 3. Agenda  Basics of Hadoop & HBase  How you can use Hadoop & HBase for big data application  Case Study: Deal Relevance and Personalization Systems at Groupon with Hadoop & HBase
  • 4. Big Data Application Examples  Recommendation Systems  Ad targeting  Personalization Systems  BI/ DW  Log Analysis  Natural Language Processing
  • 5. So what is Hadoop?  General purpose framework for processing huge amounts of data.  Open Source  Batch / Offline Oriented
  • 6. Hadoop - HDFS  Open Source Distributed File System.  Store large files. Can easily be accessed via application built on top of HDFS.  Data is distributed and replicated over multiple machines  Linux Style commands eg. ls, cp, mv, touchz etc
  • 7. Hadoop – HDFS  Example: hadoop fs –dus /data/ 185453399927478 bytes =~ 168 TB (One of the folders from one of our hadoop cluster)
  • 8. Hadoop – Map Reduce  Application Framework built on top of HDFS to process your big data  Operates on key-value pairs  Mappers filter and transform input data  Reducers aggregate mapper output
  • 9. Example • Given web logs, calculate landing page conversion rate for each product • So basically we need to see how many impressions each product received and then calculate conversion rate of for each product
  • 10. Map Reduce Example Map Phase Reduce Phase Map 1: Process Log File: Output: Key (Product ID), Value (Impression Count) Map 2: Process Log File: Output: Key (Product ID), Value (Impression Count) Map N: Process Log File: Output: Key (Product ID), Value (Impression Count) Reducer: Here we receive all data for a given product. Just run simple for loop to calculate conversion rate. (Output: Product ID, Conversion Rate
  • 11. Recap  We just processed terabytes of data, and calculated conversion rate across millions of products.  Note: This is batch process only. It takes time. You can not start this process after some one visits your website. How about we generate recommendations in batch process and serve them in real time?
  • 12. HBase  Provides real time random read/ write access over HDFS  Built on Google‟s „Big Table‟ design  Open Sourced This is not RDBMS, so no joins. Access patterns are generally simple like get(key), put(key, value) etc.
  • 13. Row Cf:<qual> Cf:<qual> Row 1 Cf1:qual1 Cf1:qual2 Row 11 Cf1:qual2 Cf1:qual22 Row 2 …. Cf2:qual1 Cf1:qual3 Row N  Dynamic Column Names. No need to define columns upfront.  Both rows and columns are (lexicological) sorted Cf:<qual>
  • 14. …. Row Cf:<qual> user1 Cf1:click_history:{actual_cl Cf1:purchases:{actual_pur icks_data} chases} user11 Cf1:purchases:{actual_pur chases} user20 Cf1:mobile_impressions:{a Cf1:purchases:{actual_pur ctual mobile impressions} chases} Note: Each row has different columns, So think about this as a hash map rather than at table with rows and columns
  • 15. Putting it all together Store data in HDFS Web Generate Recommendations (Map Reduce) Serve Real Time Requests (HBase) Analyze Data (Map Reduce) Do offline analysis in Hadoop, and serve real time requests with HBase Mobile
  • 16. Use Case: Deal Relevance & Personalization @ Groupon
  • 19. Our Relevance Scenario How do we surface relevant deals ? Users  Deals are perishable (Deals expire or are sold out)  No direct user intent (As in traditional search advertising)  Relatively Limited User Information  Deals are highly local
  • 20. Two Sides to the Relevance Problem Algorithmic Issues Scaling Issues How to find relevant deals for individual users given a set of optimization criteria How to handle relevance for all users across multiple delivery platforms
  • 21. Developing Deal Ranking Algorithms • Exploring Data • Understanding signals, finding patterns • Building Models/Heuristics • Employ both classical machine learning techniques and heuristic adjustments to estimate user purchasing behavior • Conduct Experiments • Try out ideas on real users and evaluate their effect
  • 22. Data Infrastructure Growing Deals 2011 2012 Growing Users 2013  100 Million+ subscribers  We need to store data 20+ like, user click history, 400+ email records, service logs etc. This tunes to 2000+ billions of data points and TB‟s of data
  • 23. Deal Personalization Infrastructure Use Cases • Deliver Personalized Emails • Deliver Personalized Website & Mobile Experience Email Personalize billions of emails for hundredsof millions of users Offline System Personalize one of the most popular e-commerce mobile & web app for hundreds of millions of users & page views Online System
  • 24. Architecture • We can now maintain different SLA on online and offline systems Email Real Time Relevance Relevance Map/Reduce HBase Offline System Data Pipeline Replication HBase for Online System • We can tune HBase cluster differently for online and offline systems
  • 25. HBase Schema Design User ID Column Family 1 Column Family 2 Unique Identifier for Users User History and Profile Information Email History For Users Overwrite user history and profile info Append email history for each day as a separate columns. (On avg each row has over 200 columns) • Most of our data access patterns are via “User Key” • This makes it easy to design HBase schema • The actual data is kept in JSON
  • 26. Cluster Sizing HBase Replication Hadoop + HBase Cluster 100+ machine Hadoop cluster, this runs heavy map reduce jobs The same cluster also hosts 15 node HBase cluster Online HBase Cluster 10 Machine dedicated HBase cluster to serve real time SLA • Machine Profile • 96 GB RAM (HBase 25 GB) • 24 Virtual Cores CPU • 8 2TB Disks • Data Profile • 100 Million+ Records • 2TB+ Data • Over 4.2 Billion Data Points
  • 27. Questions? Thank You! (We are hiring!) www.groupon.com/techjobs

Editor's Notes

  • #21: The relevance problem can coarsely be divided into to conceptual parts: algorithmic aspects and scale-related issues. We’ll start on the algorithmic side of things.