Parallel and Iterative Processing for Machine Learning Recommendations with Spark

© 2014 MapR Technologies 1© 2014 MapR Technologies
Parallel and Iterative Processing for
Machine Learning Recommendations with
Spark

© 2014 MapR Technologies 2
Agenda
• Collaborative Filtering with Spark
• Model training
• Alternating Least Squares
• The code

Collaborative Filtering with Spark
• Recommend Items
– (filtering)
• Based on User preferences data
– (collaborative)

Train a Model to Make Predictions
New
Data
Model Predictions
Training
Data
ModelAlgorithm
Ted and Carol like Movie B and C
Bob likes Movie B, What might he like ?
Bob likes Movie B, Predict C

Alternating Least Squares
• approximates sparse user item rating matrix
– as product of two dense matrices, User and Item factor matrices
– tries to learn the hidden features of each user and item
– algorithm alternatively fixes one factor matrix and solves for the other

ML Cross Validation Process
Data
Model
Training/
Building
Test Model
Predictions
Test
Set
Train Test loop
Training
Set

Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS
Iteration is slow because it writes/reads data to disk

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Partitions Cached in
memory

Ratings Data

Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble)
}
// create an RDD of Ratings objects
val ratingsRDD = ratingText.map(parseRating).cache()

Build Model
Data
Build
Model
Test
Set
Training
Set
split ratings RDD into training data RDD (80%)
and test data RDD (20%)
build a user product matrix model

Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
// build a ALS user product matrix model with rank=20,
iterations=10
val model = (new
ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))

Get predictions
// get predicted ratings to compare to test ratings
// call model.predict with test Userid, MovieId input data
val predictionsForTestRDD = model.predict(testUserProductRDD)
User, Movie
Test
Data
Model
Predicted
Ratings

Compare predictions to Tests
Join predicted ratings to test ratings in order to compare
((user, product),test rating) ((user, product), predicted rating)
((user, product),(test rating, predicted rating))
Key, Value Key, Value
Key, Value

Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with predictions
val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD
.join(predictionsKeyedByUserProductRDD)

Compare predictions to Tests
Find False positives: Where
test rating <= 1 and predicted rating >= 4
((user, product),(test rating, predicted rating))
Key, Value

Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ratingT, ratingP)) =>
(ratingT <= 1 && ratingP >=4)
})
falsePositives.take(2)
Array[((Int, Int), (Double, Double))] =
((3842,2858),(1.0,4.106488210964762)),
((6031,3194),(1.0,4.790778049100913))

Test Model Mean Absolute Error
//Evaluate the model using Mean Absolute Error (MAE) between
test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
meanAbsoluteError: Double = 0.7244940545944053

Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs and Tutorials:
– Movie Recommendations with Collaborative Filtering
– Spark Streaming

Machine Learning Blog
• https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-learning-recommendations-spark

Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partnership with
Databricks
– mapr-spark package with Spark, Shark, Spark Streaming today
– Spark-python, GraphX and MLLib soon
• YARN integration
– Spark can then allocate resources from cluster when needed

References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark

Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Parallel and Iterative Processing for Machine Learning Recommendations with Spark

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Parallel and Iterative Processing for Machine Learning Recommendations with Spark (20)

More from MapR Technologies (20)

Recently uploaded (20)

Parallel and Iterative Processing for Machine Learning Recommendations with Spark

Editor's Notes