SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
Parallel and Iterative Processing for
Machine Learning Recommendations with
Spark
© 2014 MapR Technologies 2
Agenda
• Collaborative Filtering with Spark
• Model training
• Alternating Least Squares
• The code
© 2014 MapR Technologies 3
Collaborative Filtering with Spark
• Recommend Items
– (filtering)
• Based on User preferences data
– (collaborative)
© 2014 MapR Technologies 4
Train a Model to Make Predictions
New
Data
Model Predictions
Training
Data
ModelAlgorithm
Ted and Carol like Movie B and C
Bob likes Movie B, What might he like ?
Bob likes Movie B, Predict C
© 2014 MapR Technologies 5
Alternating Least Squares
• approximates sparse user item rating matrix
– as product of two dense matrices, User and Item factor matrices
– tries to learn the hidden features of each user and item
– algorithm alternatively fixes one factor matrix and solves for the other
© 2014 MapR Technologies 6
ML Cross Validation Process
Data
Model
Training/
Building
Test Model
Predictions
Test
Set
Train Test loop
Training
Set
© 2014 MapR Technologies 7
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS
Iteration is slow because it writes/reads data to disk
© 2014 MapR Technologies 8
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Partitions Cached in
memory
© 2014 MapR Technologies 10
Ratings Data
© 2014 MapR Technologies 11
Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble)
}
// create an RDD of Ratings objects
val ratingsRDD = ratingText.map(parseRating).cache()
© 2014 MapR Technologies 12
Build Model
Data
Build
Model
Test
Set
Training
Set
split ratings RDD into training data RDD (80%)
and test data RDD (20%)
build a user product matrix model
© 2014 MapR Technologies 13
Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
// build a ALS user product matrix model with rank=20,
iterations=10
val model = (new
ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
© 2014 MapR Technologies 14
Get predictions
// get predicted ratings to compare to test ratings
// call model.predict with test Userid, MovieId input data
val predictionsForTestRDD = model.predict(testUserProductRDD)
User, Movie
Test
Data
Model
Predicted
Ratings
© 2014 MapR Technologies 15
Compare predictions to Tests
Join predicted ratings to test ratings in order to compare
((user, product),test rating) ((user, product), predicted rating)
((user, product),(test rating, predicted rating))
Key, Value Key, Value
Key, Value
© 2014 MapR Technologies 16
Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with predictions
val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD
.join(predictionsKeyedByUserProductRDD)
© 2014 MapR Technologies 17
Compare predictions to Tests
Find False positives: Where
test rating <= 1 and predicted rating >= 4
((user, product),(test rating, predicted rating))
Key, Value
© 2014 MapR Technologies 18
Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ratingT, ratingP)) =>
(ratingT <= 1 && ratingP >=4)
})
falsePositives.take(2)
Array[((Int, Int), (Double, Double))] =
((3842,2858),(1.0,4.106488210964762)),
((6031,3194),(1.0,4.790778049100913))
© 2014 MapR Technologies 19
Test Model Mean Absolute Error
//Evaluate the model using Mean Absolute Error (MAE) between
test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
meanAbsoluteError: Double = 0.7244940545944053
© 2014 MapR Technologies 20
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs and Tutorials:
– Movie Recommendations with Collaborative Filtering
– Spark Streaming
© 2014 MapR Technologies 21
Machine Learning Blog
• https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-learning-recommendations-spark
© 2014 MapR Technologies 22
Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partnership with
Databricks
– mapr-spark package with Spark, Shark, Spark Streaming today
– Spark-python, GraphX and MLLib soon
• YARN integration
– Spark can then allocate resources from cluster when needed
© 2014 MapR Technologies 23
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2014 MapR Technologies 24
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

PDF
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
PDF
R, Scikit-Learn and Apache Spark ML - What difference does it make?
PDF
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
PDF
GLM & GBM in H2O
PDF
Visualizing the model selection process
PDF
Data Product Architectures
PDF
Yellowbrick: Steering machine learning with visual transformers
PDF
On the representation and reuse of machine learning (ML) models
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
R, Scikit-Learn and Apache Spark ML - What difference does it make?
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
GLM & GBM in H2O
Visualizing the model selection process
Data Product Architectures
Yellowbrick: Steering machine learning with visual transformers
On the representation and reuse of machine learning (ML) models

What's hot (20)

PDF
Visualizing the Model Selection Process
PDF
Converting R to PMML
PPT
Logistic Regression using Mahout
PPTX
Learning machine learning with Yellowbrick
PDF
XGBoost @ Fyber
PDF
Introduction to XGBoost
PDF
PPTX
Ml2 train test-splits_validation_linear_regression
PDF
Data mining with caret package
PDF
Demystifying Xgboost
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
Converting Scikit-Learn to PMML
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
PDF
Why biased matrix factorization works well?
PDF
Gradient Boosted Regression Trees in scikit-learn
PDF
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
PDF
Linear Regression (Machine Learning)
ODP
End of Year Presentation
PDF
Random forest using apache mahout
Visualizing the Model Selection Process
Converting R to PMML
Logistic Regression using Mahout
Learning machine learning with Yellowbrick
XGBoost @ Fyber
Introduction to XGBoost
Ml2 train test-splits_validation_linear_regression
Data mining with caret package
Demystifying Xgboost
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Converting Scikit-Learn to PMML
Nyc open-data-2015-andvanced-sklearn-expanded
Why biased matrix factorization works well?
Gradient Boosted Regression Trees in scikit-learn
A Neural Autoregressive Approach to Collaborative Filtering (CF-NADE) Slide
Linear Regression (Machine Learning)
End of Year Presentation
Random forest using apache mahout
Ad

Viewers also liked (9)

PDF
Realizing AI Conversational Bot
PDF
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PDF
Introduction to Apache Spark
PDF
Unsupervised Learning with Apache Spark
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Music Recommendations at Scale with Spark
PDF
Collaborative Filtering with Spark
PDF
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
 
Realizing AI Conversational Bot
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Introduction to Apache Spark
Unsupervised Learning with Apache Spark
What to Expect for Big Data and Apache Spark in 2017
Music Recommendations at Scale with Spark
Collaborative Filtering with Spark
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
 
Ad

Similar to Parallel and Iterative Processing for Machine Learning Recommendations with Spark (20)

PDF
Apache Spark Machine Learning
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PDF
MLlib: Spark's Machine Learning Library
PDF
Nose Dive into Apache Spark ML
PDF
Recent Developments in Spark MLlib and Beyond
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
PDF
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
PDF
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
PDF
Free Code Friday - Machine Learning with Apache Spark
PDF
Recent Developments in Spark MLlib and Beyond
PDF
1.5.recommending music with apache spark ml
PPTX
What's new in Apache Mahout
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PPTX
Spark for Recommender Systems
PPTX
Intro to Apache Spark by Marco Vasquez
PPTX
Retail products - machine learning recommendation engine
PPTX
Productionalizing ML : Real Experience
PPTX
Apache Spark Machine Learning Decision Trees
PDF
Apache Spark & Hadoop
PDF
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark
Apache Spark Machine Learning
Hadoop France meetup Feb2016 : recommendations with spark
MLlib: Spark's Machine Learning Library
Nose Dive into Apache Spark ML
Recent Developments in Spark MLlib and Beyond
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Free Code Friday - Machine Learning with Apache Spark
Recent Developments in Spark MLlib and Beyond
1.5.recommending music with apache spark ml
What's new in Apache Mahout
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Spark for Recommender Systems
Intro to Apache Spark by Marco Vasquez
Retail products - machine learning recommendation engine
Productionalizing ML : Real Experience
Apache Spark Machine Learning Decision Trees
Apache Spark & Hadoop
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
KodekX | Application Modernization Development
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?

Parallel and Iterative Processing for Machine Learning Recommendations with Spark

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Parallel and Iterative Processing for Machine Learning Recommendations with Spark
  • 2. © 2014 MapR Technologies 2 Agenda • Collaborative Filtering with Spark • Model training • Alternating Least Squares • The code
  • 3. © 2014 MapR Technologies 3 Collaborative Filtering with Spark • Recommend Items – (filtering) • Based on User preferences data – (collaborative)
  • 4. © 2014 MapR Technologies 4 Train a Model to Make Predictions New Data Model Predictions Training Data ModelAlgorithm Ted and Carol like Movie B and C Bob likes Movie B, What might he like ? Bob likes Movie B, Predict C
  • 5. © 2014 MapR Technologies 5 Alternating Least Squares • approximates sparse user item rating matrix – as product of two dense matrices, User and Item factor matrices – tries to learn the hidden features of each user and item – algorithm alternatively fixes one factor matrix and solves for the other
  • 6. © 2014 MapR Technologies 6 ML Cross Validation Process Data Model Training/ Building Test Model Predictions Test Set Train Test loop Training Set
  • 7. © 2014 MapR Technologies 7 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS Iteration is slow because it writes/reads data to disk
  • 8. © 2014 MapR Technologies 8 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Partitions Cached in memory
  • 9. © 2014 MapR Technologies 10 Ratings Data
  • 10. © 2014 MapR Technologies 11 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  • 11. © 2014 MapR Technologies 12 Build Model Data Build Model Test Set Training Set split ratings RDD into training data RDD (80%) and test data RDD (20%) build a user product matrix model
  • 12. © 2014 MapR Technologies 13 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
  • 13. © 2014 MapR Technologies 14 Get predictions // get predicted ratings to compare to test ratings // call model.predict with test Userid, MovieId input data val predictionsForTestRDD = model.predict(testUserProductRDD) User, Movie Test Data Model Predicted Ratings
  • 14. © 2014 MapR Technologies 15 Compare predictions to Tests Join predicted ratings to test ratings in order to compare ((user, product),test rating) ((user, product), predicted rating) ((user, product),(test rating, predicted rating)) Key, Value Key, Value Key, Value
  • 15. © 2014 MapR Technologies 16 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD .join(predictionsKeyedByUserProductRDD)
  • 16. © 2014 MapR Technologies 17 Compare predictions to Tests Find False positives: Where test rating <= 1 and predicted rating >= 4 ((user, product),(test rating, predicted rating)) Key, Value
  • 17. © 2014 MapR Technologies 18 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  • 18. © 2014 MapR Technologies 19 Test Model Mean Absolute Error //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  • 19. © 2014 MapR Technologies 20 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  • 20. © 2014 MapR Technologies 21 Machine Learning Blog • https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  • 21. © 2014 MapR Technologies 22 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  • 22. © 2014 MapR Technologies 23 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 23. © 2014 MapR Technologies 24 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #4: Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part). The collaborative filtering approach is based on similarity; the basic idea is people who liked similar items in the past will like similar items in the future. In the example shown, Ted likes movies A, B, and C. Carol likes movies B and C. Bob likes movie B. To recommend a movie to Bob, we calculate that users who liked B also liked C, so C is a possible recommendation for Bob. Of course, this is a tiny example. In real situations, we would have much more data to work with.
  • #5: The goal of a collaborative filtering algorithm is to take preferences data from users and to create a model which can be used for recommendations or predictions. Ted likes movies A, B, and C. Carol likes movies B and C. So we take this data , run it through an algorithm to build a model. Then when we have new Data such as Bob likes movie B, we use the model to predict that C is a possible recommendation for Bob.
  • #6: ALS approximates the sparse user item rating matrix of dimension K as the product of two dense matrices, User and Item factor matrices of size U×K and I×K (see picture below). The factor matrices are also called latent feature models. The factor matrices represent hidden features which the algorithm tries to discover. One matrix tries to describe the latent or hidden features of each user, and one tries to describe latent properties of each movie. ALS is an iterative algorithm. In each iteration, the algorithm alternatively fixes one factor matrix and solves for the other, and this process continues until it converges. This alternation between which matrix to optimize is where the "alternating" in the name comes from.
  • #7: A typical machine learning workflow is shown , we will perform the following steps: Load the sample data. Parse the data into the input format for the ALS algorithm. Split the data into two parts, one for building the model and one for testing the model. Run the ALS algorithm to build/train a user product matrix model. Make predictions with the training data and observe the results. Test the model with the test data.
  • #8: Spark is especially useful for parallel processing of distributed data with iterative algorithms. Spark tries to keep things in memory, whereas MapReduce involves more reading and writing from disk. As shown in the image below, for each MapReduce Job, data is read from an HDFS file for a mapper, written to and from a SequenceFile in between, and then written to an output file from a reducer. When a chain of multiple jobs is needed, Spark can execute much faster by keeping data in memory.
  • #9: Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
  • #10: An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them. So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
  • #12: We use the org.apache.spark.mllib.recommendation.Rating class for parsing the ratings.dat file. Later we will use the Rating class as input for the ALS run method. Then we use the map transformation on ratingText, which will apply the parseRating function to each element in ratingText and return a new RDD of Rating objects. We cache the ratings data, since we will use this data to build the matrix model.
  • #13: Next we we Split the data into two parts, one for building the model and one for testing the model. Then we Run the ALS algorithm to build/train a user product matrix model.
  • #14: Next we we Split the data into two parts, one for building the model and one for testing the model. Then we Run the ALS algorithm to build/train a user product matrix model.
  • #15: Next we get predicted movie ratings for the test data: by calling model.predict with test User id , Movie Id input data
  • #16: Next we will compare test User id , Movie Id Ratings to the test Userid, Movie Id predicted Rating
  • #17: Here we create User id , Movie Id , Ratings key value pairs for joining in order to compare the test ratings to the predicted ratings
  • #18: Next we will compare test User id , Movie Id Ratings to the test Userid, Movie Id predicted Rating
  • #19: Here we compare test ratings and predicted ratings by filtering on ratings where the test rating<=1 and the predicted rating is >=4
  • #20: we register the DataFrame as a table. Registering it as a table allows us to use it in subsequent SQL statements.   Now we can inspect the data.
  • #22: https://www.mapr.com/blog/parallel-and-iterative-processing-machine-learning-recommendations-spark