Building Recommendation Platforms with Hadoop

REMINDER
Check in on the COLLABORATE
mobile app
Building Recommendation
Platforms with Hadoop
Prepared by:
Jayant Shekhar
Sr. Solutions Architect
Cloudera

Agenda
■ Why Big Data Recommendation Platform?
■ Common Recommendation Patterns & Algorithms
■ Lambda Architecture
■ Architecture & Design of Computation & Serving Layers
■ Social Recommendations with Giraph
■ Recommendations with Solr
■ Recommendation with Storm/HBase

Recommendations is one of the commonly used use cases of Hadoop
Recommendations can be
Recommendations Broader Use Cases
• Product Recommendation
• People/Social Recommendation
• Merchant Recommendation
• Content Recommendation
• Query Recommendation
• Sponsored Search Advertising
• Realtime
• News Recommendations
• Merchant/Offer
Recommendations on mobile
• Offline
• Similar Profiles/Resumes
• In Between

• Web
• Mobile
• Email
• Postal mail
• Newspaper/Magazine
ads
Recommendations are delivered through Data Sets Involved
• Items/Products/Content
• Transaction Data
• User Data
• Logs & User Activity
• Additional 3rd Party Data
• Geo
• Social
• Reviews
• …
Different Time to Action Targeted
• User would view the content now/Buy the Product
Now
• User would buy the product in a week
• Next when he/she goes grocery shopping
• User would buy the product in the next 3 months
• TV/Dishwasher etc.
• Vacation
Also need to be able to
determine/differentiate between
the users in a household

Common Recommendation
Patterns & associated
Algorithms

Some ML Algorithms used for Recommendations
• Collaborative Filtering
• Clustering
• Classification &
Regression
• Pattern Mining
Collaborative Filtering Clustering
• ALS
• SVD
• Slope One
Recommender
• K-means
• Canopy
• Fuzzy K-Means
• Parallel FP-Growth
• Logistic Regression
• Naïve Bayes
• Random Forest
Classification & Regression Pattern Mining
CLASS

Product Recommendation
Use Cases
• Recommend Product
• Recommend Movies/Videos
Algorithms
• Collaborative Filtering
• Logistic Regression
Frequently bought/viewed together
Use Cases
• Find items that are frequently
bought together
• Related Searches/Query
Suggestion
• View Item Page
Algorithms
• Parallel FP-
Growth://infolab.stanford.edu/~
echang/recsys08-69.pdf

Related Searches or Query Recommendations
Design
• Use Query Log Data
• Cluster similar queries
• Use Parallel FP Growth to
find the related searches
Query Distance
• Based on keywords or phrases
• Based on searches in the same
session
• Based on common clicked URLs
• Based on the distance of the
clicked documents
Related Articles/News
• Batch clustering with K-Means
• NRT clustering using the centroids
• Perform canopy on left over articles

Social/People Recommendations
Use Case
• Recommend Missing Links in a
Social Network
• Bipartite Matching –
Recommend Men/Women
Design
• Take existing edges and friend of
friends
• Build Regression Models based on
latest activity
• Scale easily offline with Hadoop as
number of friends of friends and
activities could be very high.
• Giraph

Lambda Architecture
Stream Processing
Realtime View
New Data Stream
All Data
Pre-compute
Views
Batch View Batch View
Query
Lambda architecture
proposed by Nathan Marz,
creator of Storm

Lambda Architecture
Large Scale Offline Batch
+
Real-time Online Streaming
■ Batch Layer : offline, asynchronous
■ Serving Layer : real-time, incremental, approximate

Computation & Serving Layers
for Recommendations

Closer View of Oryx Serving & Model Generation
HDFS
Serving
Layer
Serving
Layer
Serving
Layer
A
P
I
Generation 0 Generation 1 Generation 2
Computation Layer
Generation directory contains:
• Input data
• Configuration
• Model
Generation 3

Feature Generation & Model Building
HDFS
Data Data
Data Data
Data Data
Feature Generation
Model Model
Model Model
Model Model
Model Generation
Hadoop enables easy iteration over the process of Model Generation
and testing it out offline.

Requirements for ML on Hadoop
■ Model Building
▪ Large Scale Distributed
▪ Continuous
■ Model Serving
▪ Real-time query
▪ Real-time updates
■ Algorithms
▪ Parallelizable
▪ Updateable
■ Interoperable
▪ PMML model format
▪ Simple REST API
▪ Open Source

Computation Layer Vs Serving Layer
■ Computation Layer
▪ Periodically builds generation from recent data and past model
▪ Baby sits MR job
▪ Publishes Model
■ Serving Layer
▪ Consumes Model
▪ Serves queries from model in memory
▪ Updates the model from new input
▪ Also writes input to HDFS
▪ Replicas for scale

Collaborative Filtering : ALS
■ Alternating Least Squares
■ Matrix Factorization
■ Faster than SVD
■ Real-time update
■ Parallelizable

Clustering : K-means++
■ Well-known and understood
■ Parallelizable
■ Clusters Updateable
■ Obtains an initial set of centers that is close to the optimum
solution.

Classification/Regression : RDF
■ Random Decision Forests
■ Ensemble Method
■ Numerical, Categorical features and target
■ Very Parallel
■ Nodes Updateable

Social Recommendation with
Giraph

Graph Use Cases
• Social Recommendations
• Recommend missing links in a social network
• Twitter Graph
• Who to follow
• Similar To
• Bipartite Matching
• Matching job/employees, men/women
• How are users connected
• Clustering – find related people in groups

Giraph
■ Each vertex has an id, a value, a list of its adjacent
neighbour ids and the corresponding edge values
■ Edges are always directed
▪ Out-edges attached to a node
▪ Nodes can’t see inbound edges
■ Nodes communicate via messages
■ No remote reads

Giraph BSP
■ Input is a directed graph
■ Each vertex is invoked in each superstep, can
recompute its value and send messages to other
vertices, which are delivered over superstep barriers
■ This is done till every Vertex votes to halt
■ Output is a directed graph

ML Algorithms with Graph Processing
■ Collaborative Filtering
■ Clustering
■ Gradient Descent : Linear Regression, Logistic Regression

Matrix factorization
M = U X
V
ALS : fix one side and solve for the other

Representing Matrix by Graphs
3 - 8
- 9 5
5 - -
3
1
2
1
2
3
Row Column
3
8
9
55
• every vertex holds a row vector

Lucene Inverted Index
Term Documents
framework 1[1x]
for 1[1x] , 5[1x]
job 1[1x]
data 2[1x] , 4[1x]
… ...
and 3[1x], 4[1x]
wide 5[1x]
variety 5[2x]
… …
Document Content Field
1 framework for job
scheduling
2 data warehouse
infrastructure and
3 fast and general compute
engine
4 data serialization system
and
5 wide variety of companies
… …
Input Documents Index

Recommendation Approaches in Solr
■ Attribute-based
■ Textual Similarity-based
■ More-like-this
■ Collaborative Filtering

Attribute-based Recommendations
■ Example: Match User Attributes to Item Attribute Fields
/solr/select/?q=(grouptitle:”big data”^25 OR grouptitle:(java)^10) AND
((city:”Las Vegas” AND state:”NV”)^15 OR state:”NV”)”

Textual Similarity-based Recommendations
■ Solr’s MoreLikeThis Search Component.
■ Extracts important keywords from one or more documents
and uses them in search.
■ This results in secondary search results which demonstrate
textual similarity to the original document
■ http://wiki.apache.org/solr/MoreLikeThis

Content Recommendation
■ Even a single keyword can be enough to begin making meaningful
recommendations.
■ Filtering or boosting results based upon geographical area or
distance can help greatly for certain use cases:
▪ Jobs/Resumes, Events, Restaurants
■ /solr/select/?q=(Standard Recommendation Query) AND
_val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”

Behavior Based Recommendation Approaches
Collaborative Filtering : Uses who likes these also liked…
■ Step 1: Find similar users who like the same documents
q=documentid: (“doc1” OR “doc4”)
■ Step 2: Search for docs “liked” by those similar users
/solr/select/?q=userlikes: (“user5”^2
OR “user4”^2 OR “user1”^1)

Cloudera Search Architecture
HDFS
Online Streaming Data
End User Client App
(e.g. Hue)
Flume
Raw, filtered, or
annotated data
SolrCloud Cluster(s)
NRT Data
indexed w/
Morphlines
Indexed data
MapReduce Batch Indexing
w/ Morphlines
GoLive updates
HBase
Cluster
NRT Replication
Events indexed
w/ Morphlines
OLTP Data
ClouderaManager
Search queries

Real-time Architecture using Storm & Hadoop
Key/Value StoreStorm
Incoming Data
Hadoop
Query

Real-time and Storm
■ The query layer queries the real-time and batch and merges
the result
■ Some algorithms are hard to implement in real time. For
those cases we could estimate the results.
■ The model is generated offline on Hadoop and deployed into
Storm.
■ Online learning algorithms can be used in Storm. They learn
continuously through streaming training data.
■ Storm can also be used for scoring.

Storm/Track Realtime Events
■ Real-time streaming analytics/stats on consumer viewing behavior
and digital content trends.
■ Track impressions, clicks, conversions, bid requests etc. in real
time. Push per minute aggregations to HBase.
■ Most Popular Searches/Downloads/News Articles/Movies/Products

Training of Models
A/B Testing

Offline Training & Testing of Models
Use Cases
• Recommend Missing Links in a Social Network
• How are users connected
• Clustering – find related people in groups
• Iterative Graph Ranking
Hadoop provides an excellent platform to train and test out the Models and various Algorithms
Model
Train Test
Training Set Test Set
Score

A/B Testing
Traffic
New Model
Old Model
X%
(100-X)%
A/B testing is used to test the performance of the Models online
A/B testing involves:
• Partitioning real traffic to two models and then measuring the performance to the
desired result (maximize CTR, revenue, page views etc.).
• The partitioning logic can get complicated. In such cases they can be pre-computed on
Hadoop offline and pushed to an online store.

Please complete the session
evaluation on the mobile app
We appreciate your feedback and insight

Trends, Aggregations & Counters
• Most Popular Searches/Downloads/News Articles/Movies/Products
• Load results into HBase
• Use HBase where we need NRT count of things (categories/products etc.)
• Impala is very useful here for faster SLAs
HBase Counters
• Has concept of Incrementing column values
• Avoids lock row/read value/increment it/write it back/unlock rows
• Great for counting specific metrics over time
• Example - count per URL/Product
• Can disable write to WAL on puts

Building Recommendation Platforms with Hadoop

More Related Content

What's hot (18)

Similar to Building Recommendation Platforms with Hadoop (20)

Recently uploaded (20)

Building Recommendation Platforms with Hadoop

Editor's Notes