SlideShare a Scribd company logo
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Mahout
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




What is Apache Mahout?

• A scalable machine learning infrastructure


• Built on top of Hadoop MapReduce


• Currently supports:


   • Clustering, classification, and collaborative filtering, etc...
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




A Little History

• Founded by folks active in the Lucene community


• Inspired by work at Stanford: “Map-Reduce for Machine Learning on
  Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-
  mapreducemulticore.pdf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Project Goal

• Create a community driven scalable and robust machine learning
  infrastructure


• Leverage Hadoop for parallel processing and scalability


• Provide an abstraction on top of Hadoop so the machine-learning users are
  not concerned with the map and reduce primitives when they build their
  solutions.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Algorithms

 • Collaborative Filtering


 • User and Item based recommenders


 • K-Means, Fuzzy K-Means clustering


 • Mean Shift clustering


 • Dirichlet process clustering
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Supported Algorithms

 • Latent Dirichlet Allocation


 • Singular value decomposition


 • Parallel Frequent Pattern mining


 • Complementary Naive Bayes classifier


 • Random forest decision tree based classifier


 • ...and growing
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Focus Areas

 • Collaborative Filtering


 • Clustering


 • Classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Build and Install

• Required Software:


   • Java 1.6.x


   • Maven 2.0.11+


• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout


• Compile & install core & examples: mvn install


   • Alternatively, individually mvn compile, mvn package, and mvn install
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                     Copyright for all other & referenced work is retained by their respective owners.




Recommendation Examples

 • mvn -q exec:java -
   Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group
   LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/
   workspace/hadoop_workspace/grouplens/ratings.dat"


 • https://cwiki.apache.org/confluence/display/MAHOUT/
   RecommendationExamples
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Common Use Cases

 • Shopping: Amazon, Netflix


 • Who to follow/friend: Twitter/Facebook


 • Web resource classification, spam filtering, financial markets pattern
   recognition, classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Collaborative Filtering Basis

  • User-based: recommend items by finding similar users. User preferences
    keep changing so this method poses challenges.


  • Item-based: calculate similarity between items and make
    recommendations. Usually items don’t change much so the method is
    often reliable.


  • Slope-one: fast and efficient item based recommendation when user
    ratings are more than boolean yes/no, like/dislike.


  • Model-based: provide recommendation on the basis of developing a
    model of users and their ratings.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Clustering Basis

 • Clustering algorithms also use the notion of similarity to group similar
   items into a cluster.


 • Both Collaborative filtering and clustering use the notion of a distance,
   which could be calculated using a number of different techniques.


    • Example: Euclidean distance,
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.


• Mahout supports building recommendation engines primarily basis the Taste
  library.


   • The library supports both user-based and item-based recommendations.


• Can be used with Java or over RESTful web-service endpoints.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Primary Classes

 • DataModel: Model for Users, Items, and Preferences


 • UserSimilarity: Interface defining the similarity between two users


 • ItemSimilarity: Interface defining the similarity between two items


 • Recommender: Interface for providing recommendations


 • UserNeighborhood: Interface for computing a neighborhood of similar
   users. These are used by the Recommenders.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Online vs Offline

 • Can do online recommendations for a few thousand data sets.


 • Leverages Hadoop for offline recommendation calculations on large data
   sets.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Understanding the Group Lens Implementation

• Provide an insight into a sample Mahout Taste Framework Implementation.


• Uses the publicly available data set


• Part of the distribution so you can analyze it, modify it, and use it as an
  inspiration for your own implementation


• Easy to follow example
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Group Lens Implementation Source

• GroupLensDataModel.java


• GroupLensRecommender.java


• GroupLensRecommenderBuilder.java


• GroupLensRecommenderEvaluatorRunner.java
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- evaluator

• Instantiates an evaluator:


   • RecommenderEvaluator evaluator = new
     AverageAbsoluteDifferenceRecommenderEvaluator();


   • a “mean average error” algorithm


• Parses input parameters:


   • File ratingsFile = TasteOptionParser.getRatings(args);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- data model

 • Parses a colon delimiter pattern file:


    • DataModel model = ratingsFile == null ? new GroupLensDataModel() :
      new GroupLensDataModel(ratingsFile);
Group Lens Runner -- evaluate with
                Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




recommendation builder

• evaluates using GroupLensRecommender


  • double evaluation = evaluator.evaluate(new
    GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

PPTX
Integration patterns in AEM 6
PPT
Cypher
KEY
Machine Learning & Apache Mahout
PPTX
Apache Mahout
PDF
Tutorial Mahout - Recommendation
PPTX
Recommendation Systems
DOC
Download Materials
PDF
SDEC2011 Essentials of Pig
Integration patterns in AEM 6
Cypher
Machine Learning & Apache Mahout
Apache Mahout
Tutorial Mahout - Recommendation
Recommendation Systems
Download Materials
SDEC2011 Essentials of Pig

Similar to SDEC2011 Essentials of Mahout (20)

PDF
Mahout Tutorial and Hands-on (version 2015)
PPTX
Neev Open Source Contributions
PPT
Buidling large scale recommendation engine
PDF
SDEC2011 Essentials of Hive
PDF
Recommendation engines matching items to users
PDF
Recommendation engines : Matching items to users
PPTX
Building Large Sustainable Apps
PPTX
Docs as Part of the Product - Open Source Summit North America 2018
PPTX
Automated perf optimization - jQuery Conference
PDF
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
PPTX
MahoutNew
PPTX
Vector Databases and Why Are They Used in Modern AI - Marko Lohert - ATD 2024
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PPTX
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
PPTX
Part of the DLM story: Get your Database under Source Control - SQL In The City
PDF
Presentation 1 Web--dev
PDF
Collaborative Filtering and Recommender Systems By Navisro Analytics
PPTX
Lec 11-12 Search engines for easy use.pptx
PPT
Case study
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Mahout Tutorial and Hands-on (version 2015)
Neev Open Source Contributions
Buidling large scale recommendation engine
SDEC2011 Essentials of Hive
Recommendation engines matching items to users
Recommendation engines : Matching items to users
Building Large Sustainable Apps
Docs as Part of the Product - Open Source Summit North America 2018
Automated perf optimization - jQuery Conference
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
MahoutNew
Vector Databases and Why Are They Used in Modern AI - Marko Lohert - ATD 2024
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Part of the DLM story: Get your Database under Source Control - SQL In The City
Presentation 1 Web--dev
Collaborative Filtering and Recommender Systems By Navisro Analytics
Lec 11-12 Search engines for easy use.pptx
Case study
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Ad

More from Korea Sdec (14)

KEY
SDEC2011 Big engineer vs small entreprenuer
PDF
SDEC2011 Implementing me2day friend suggestion
PDF
SDEC2011 Introducing Hadoop
PDF
Sdec2011 shashank-introducing hadoop
PDF
SDEC2011 NoSQL Data modelling
PDF
SDEC2011 NoSQL concepts and models
ZIP
Sdec2011 Introducing Hadoop
PDF
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
PDF
SDEC2011 Rapidant
PDF
SDEC2011 Mahout - the what, the how and the why
PDF
SDEC2011 Going by TACC
PDF
SDEC2011 Glory-FS development & Experiences
PDF
SDEC2011 Using Couchbase for social game scaling and speed
PDF
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Introducing Hadoop
Sdec2011 shashank-introducing hadoop
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL concepts and models
Sdec2011 Introducing Hadoop
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Rapidant
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Going by TACC
SDEC2011 Glory-FS development & Experiences
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Arcus NHN memcached cloud
Ad

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”

SDEC2011 Essentials of Mahout

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Mahout Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Apache Mahout? • A scalable machine learning infrastructure • Built on top of Hadoop MapReduce • Currently supports: • Clustering, classification, and collaborative filtering, etc...
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A Little History • Founded by folks active in the Lucene community • Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06- mapreducemulticore.pdf
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Project Goal • Create a community driven scalable and robust machine learning infrastructure • Leverage Hadoop for parallel processing and scalability • Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Algorithms • Collaborative Filtering • User and Item based recommenders • K-Means, Fuzzy K-Means clustering • Mean Shift clustering • Dirichlet process clustering
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Supported Algorithms • Latent Dirichlet Allocation • Singular value decomposition • Parallel Frequent Pattern mining • Complementary Naive Bayes classifier • Random forest decision tree based classifier • ...and growing
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Focus Areas • Collaborative Filtering • Clustering • Classification
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build and Install • Required Software: • Java 1.6.x • Maven 2.0.11+ • Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout • Compile & install core & examples: mvn install • Alternatively, individually mvn compile, mvn package, and mvn install
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Recommendation Examples • mvn -q exec:java - Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/ workspace/hadoop_workspace/grouplens/ratings.dat" • https://cwiki.apache.org/confluence/display/MAHOUT/ RecommendationExamples
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Common Use Cases • Shopping: Amazon, Netflix • Who to follow/friend: Twitter/Facebook • Web resource classification, spam filtering, financial markets pattern recognition, classification
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Collaborative Filtering Basis • User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges. • Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable. • Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike. • Model-based: provide recommendation on the basis of developing a model of users and their ratings.
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Clustering Basis • Clustering algorithms also use the notion of similarity to group similar items into a cluster. • Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques. • Example: Euclidean distance,
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully.
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully. • Mahout supports building recommendation engines primarily basis the Taste library. • The library supports both user-based and item-based recommendations. • Can be used with Java or over RESTful web-service endpoints.
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Primary Classes • DataModel: Model for Users, Items, and Preferences • UserSimilarity: Interface defining the similarity between two users • ItemSimilarity: Interface defining the similarity between two items • Recommender: Interface for providing recommendations • UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Online vs Offline • Can do online recommendations for a few thousand data sets. • Leverages Hadoop for offline recommendation calculations on large data sets.
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Understanding the Group Lens Implementation • Provide an insight into a sample Mahout Taste Framework Implementation. • Uses the publicly available data set • Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation • Easy to follow example
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Implementation Source • GroupLensDataModel.java • GroupLensRecommender.java • GroupLensRecommenderBuilder.java • GroupLensRecommenderEvaluatorRunner.java
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- evaluator • Instantiates an evaluator: • RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); • a “mean average error” algorithm • Parses input parameters: • File ratingsFile = TasteOptionParser.getRatings(args);
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- data model • Parses a colon delimiter pattern file: • DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
  • 21. Group Lens Runner -- evaluate with Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. recommendation builder • evaluates using GroupLensRecommender • double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com