Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users

USING HADOOP & HBASE
TO BUILD CONTENT
RELEVANCE &
PERSONALIZATION
Tools to build your big data application
Ameya Kanitkar

Ameya Kanitkar – That‟s me!
• Big Data Infrastructure Engineer @ Groupon, Palo Alto

USA (Working on Deal Relevance & Personalization
Systems)
ameya.kanitkar@gmail.com
http://www.linkedin.com/in/ameyakanitkar
@aktwits

Agenda
 Basics of Hadoop & HBase
 How you can use Hadoop & HBase for big data

application
 Case Study: Deal Relevance and Personalization

Systems at Groupon with Hadoop & HBase

Big Data Application Examples
 Recommendation Systems
 Ad targeting
 Personalization Systems
 BI/ DW
 Log Analysis
 Natural Language Processing

So what is Hadoop?
 General purpose framework for processing huge

amounts of data.
 Open Source

 Batch / Offline Oriented

Hadoop - HDFS
 Open Source Distributed File System.

 Store large files. Can easily be accessed via application

built on top of HDFS.
 Data is distributed and replicated over multiple machines
 Linux Style commands eg. ls, cp, mv, touchz etc

Hadoop – HDFS
 Example:

hadoop fs –dus /data/
185453399927478 bytes =~ 168 TB

(One of the folders from one of our hadoop cluster)

Hadoop – Map Reduce
 Application Framework built on top of HDFS to process

your big data
 Operates on key-value pairs
 Mappers filter and transform input data
 Reducers aggregate mapper output

Example
• Given web logs, calculate landing page conversion rate

for each product

• So basically we need to see how many impressions each

product received and then calculate conversion rate of for
each product

Map Reduce Example
Map Phase

Reduce Phase

Map 1: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map 2: Process Log File:
(Impression Count)
Map N: Process Log File:
(Impression Count)

Reducer: Here we receive all
data for a given product. Just run
simple for loop to calculate
conversion rate.
(Output: Product ID, Conversion
Rate

Recap
 We just processed terabytes of data, and calculated

conversion rate across millions of products.
 Note: This is batch process only. It takes time. You can

not start this process after some one visits your website.

How about we generate recommendations in batch process
and serve them in real time?

HBase
 Provides real time random read/ write access over HDFS

 Built on Google‟s „Big Table‟ design
 Open Sourced

This is not RDBMS, so no joins. Access patterns are

generally simple like get(key), put(key, value) etc.

Row

Cf:<qual>

Cf:<qual>

Row 1

Cf1:qual1

Cf1:qual2

Row 11

Cf1:qual2

Cf1:qual22

Row 2

….

Cf2:qual1

Cf1:qual3

Row N

 Dynamic Column Names. No need to define columns upfront.
 Both rows and columns are (lexicological) sorted

Cf:<qual>

….

Row

Cf:<qual>

user1

Cf1:click_history:{actual_cl Cf1:purchases:{actual_pur
icks_data}
chases}

user11

Cf1:purchases:{actual_pur
chases}

user20

Cf1:mobile_impressions:{a Cf1:purchases:{actual_pur
ctual mobile impressions} chases}

Note: Each row has different columns, So think about this as a hash map rather
than at table with rows and columns

Putting it all together
Store data in
HDFS

Web
Generate
Recommendations
(Map Reduce)

Serve Real Time
Requests
(HBase)

Analyze Data
(Map Reduce)

Do offline analysis in Hadoop, and serve real time requests with HBase

Mobile

Use Case: Deal Relevance &
Personalization @ Groupon

Our Relevance Scenario
How do we surface relevant
deals ?
Users
 Deals are perishable (Deals
expire or are sold out)
 No direct user intent (As in
traditional search
advertising)

 Relatively Limited User
Information
 Deals are highly local

Two Sides to the Relevance Problem

Algorithmic
Issues

Scaling
Issues

How to find
relevant deals for
individual users
given a set of
optimization criteria

How to handle
relevance for
all users across
multiple
delivery platforms

Developing Deal Ranking Algorithms
• Exploring Data
• Understanding signals, finding

patterns

• Building Models/Heuristics
• Employ both classical machine

learning techniques and heuristic
adjustments to estimate user
purchasing behavior

• Conduct Experiments
• Try out ideas on real users and

evaluate their effect

Data Infrastructure
Growing Deals
2011

2012

Growing Users
2013

 100 Million+
subscribers

 We need to store data

20+

like, user click history,
400+

email records, service

logs etc. This tunes to
2000+

billions of data points
and TB‟s of data

Deal Personalization Infrastructure Use
Cases
• Deliver Personalized

Emails

• Deliver Personalized

Website & Mobile
Experience

Email

Personalize billions of emails for
hundredsof millions of users

Offline System

Personalize one of the most popular
e-commerce mobile & web app
for hundreds of millions of
users & page views

Online System

Architecture
• We can now
maintain different
SLA on online and
offline systems

Email

Real Time
Relevance

Relevance
Map/Reduce

HBase
Offline
System

Data Pipeline

Replication

HBase for
Online System

• We can tune
HBase cluster
differently for
online and offline
systems

HBase Schema Design
User ID

Column Family 1

Column Family 2

Unique Identifier
for Users

User History and
Profile Information

Email History For Users

Overwrite user history
and profile info

Append email history for
each day as a separate
columns. (On avg each
row has over 200
columns)

• Most of our data access patterns are via “User Key”
• This makes it easy to design HBase schema
• The actual data is kept in JSON

Cluster Sizing
HBase
Replication

Hadoop +
HBase
Cluster

100+ machine Hadoop
cluster, this runs heavy
map reduce jobs
The same cluster also
hosts 15 node HBase
cluster

Online HBase
Cluster

10 Machine
dedicated HBase
cluster to serve
real time SLA

• Machine Profile
• 96 GB RAM (HBase
25 GB)
• 24 Virtual Cores
CPU
• 8 2TB Disks
• Data Profile
• 100 Million+
Records
• 2TB+ Data
• Over 4.2 Billion Data
Points

Questions?

Thank You!
(We are hiring!)
www.groupon.com/techjobs

Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users

More Related Content

What's hot (11)

Viewers also liked (20)

Similar to Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users (20)

More from WebExpo (20)

Recently uploaded (20)

Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users

Editor's Notes