The what, the why and the how
Speaker: Robin Anil, Apache Mahout PMC Member

          SDEC, Seoul, Korea, June 2011
about:me
● Apache Mahout PMC member
● A ML Believer
● Author of Mahout in Action
● Software Engineer @ Google

 *in no particular order


● Previous Life: Google Summer of Code student for 2 years.
about:agenda
● Introducing Mahout
● Why Mahout?
● Birds eye view of Mahout
● Classic Machine learning problems
● Short overview of Mahout clustering
about:mission




To build a scalable machine learning library
Scale!
● Scale to large datasets
   ○ Hadoop MapReduce implementations that scales linearly
     with data.
   ○ Fast sequential algorithms whose runtime doesn’t depend
     on the size of the data
   ○ Goal: To be as fast as possible for any algorithm
● Scalable to support your business case
   ○ Apache Software License 2
● Scalable community
   ○ Vibrant, responsive and diverse
   ○ Come to the mailing list and find out more
about:why
● Lack community
● Lack scalability
● Lack documentations and examples
● Lack Apache licensing
● Are not well tested
● Are Research oriented
● Not built over existing production quality libraries
Birds eye view of Mahout
● If you want to:
   ○ Encode
   ○ Analyze
   ○ Predict
   ○ Get top best
about:encode
● Process data and convert to vectors
● Dictionary based v/s Randomizer based
● Get best signals for generating vectors
   ○ Collocation information (ngrams)
   ○ Lp Normalization



  Data      Engineering Camp

  1:1.0     2:1.0            3:1.0
about:analyze
● Cluster and group data to
● Cluster data
   ○ K-Means
   ○ Fuzzy K-Means
   ○ Canopy
   ○ Mean Shift
   ○ Dirichlet process clustering
   ○ Spectral Clustering
● Co-cluster features / dimensionality reduction
   ○ Latent Dirichlet Allocation (LDA)
   ○ Singular Value Decomposition
about:clustering
News clusters
about:lda
● Grouping similar or co-occurring features into a topic
   ○ Topic “Lol Cat”:
       ■ Cat
       ■ Meow
       ■ Purr
       ■ Haz
       ■ Cheeseburger
       ■ Lol
about:predict
● Classification and Recommendation


● Classification:
   ○ Use features learn model
   ○ Apply model on unknown
● Recommendation
   ○ Use pairwise(user-item) information to learn model
   ○ For a given user return highly likely items
about:classify
● Predicting the type of a new object based on its features
● The types are predetermined




                      Dog           Cat
about:classify
● Plenty of algorithms
   ○ Naïve Bayes
   ○ Complementary Naïve Bayes
   ○ Random Forests
    ○ Stocastic Gradient Descent(regression)
● Learn a model from a manually classified data
● Predict the class of a new object based on its
  features and the learned model
about:recommend
● Predict what the user likes based on
   ○ His/Her historical behavior
   ○ Aggregate behavior of people similar to him
about:recommend
● Different types of recommenders
   ○ User based
   ○ Item based
    ○ Co-occurrence based
● Full framework for storage, online
  online and offline computation of recommendations
● Like clustering, there is a notion of similarity in users or items
   ○ Cosine, Tanimoto, Pearson and LLR
about:top-n
● Frequent Pattern Mining:
   ○ Identify top-K patterns
about:frequent-pattern-mining
● Find interesting groups of items based on how they co-occur in
  a dataset
about:parallel-fp-growth
● Identify the most commonly
  occurring patterns from
   ○ Sales Transactions
     buy “Milk, eggs and bread”
   ○ Query Logs
        ipad -> apple, tablet, iphone
   ○ Spam Detection
        Yahoo! http://www.slideshare.net/hadoopusergroup/mail-
        antispam
summary:in-short
● Plenty of overlap. There is no one algorithm to fit all problems.
● Analyze and iterate fast
● MapReduce implementations makes these Fly!
Did you know?
● Apache Mahout uses Colt high-performance collections
   ○ Open HashMaps instead of Chained HashMaps
   ○ Arrays of Primitive types
   ○ Available as Mahout Math library
● Mahout Vector uses integer encoding techniques to reduce
  space.
● Fastest classifier in Mahout doesn’t use MapReduce!
   ○ And it learns online
   ○ And It doesn’t look at all the data.
How to use mahout
● Command line launcher bin/mahout
● See the list of tools and algorithms by running bin/mahout
● Run any algorithm by its shortname:
   ○ bin/mahout kmeans –help
● By default runs locally
● export HADOOP_HOME = /pathto/hadoop-0.20.2/
   ○ Runs on the cluster configured as per the conf files in the
     hadoop directory
● Use driver classes to launch jobs:
   ○ KMeansDriver.runjob(Path input, Path output …)
Clustering Walkthrough (tiny example)
● Input: set of text files in a directory
● Download Mahout and unzip
   ○ mvn install
   ○ bin/mahout seqdirectory –i <input> –o <seq-output>
   ○ bin/mahout seq2sparse –i seq-output –o <vector-output>
   ○ bin/mahout kmeans –i<vector-output>
     -c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
Clustering Walkthrough (a bit more)

● Use bigrams: -ng 2
● Prune low frequency: –s 10
● Normalize: -n 2


● Use a distance measure : -dm org.apache.mahout.common.
  distance.CosineDistanceMeasure
Clustering Walkthrough (viewing results)
 ● bin/mahout clusterdump
   –s cluster-output/clusters-9/part-00000
   -d vector-output/dictionary.file-*
   -dt sequencefile -n 5 -b 100
 ● Top terms in a typical cluster
comic => 9.793121272867376
comics => 6.115341078151356
con => 5.015090566692931
sdcc => 3.927590843402978
webcomics => 2.916910980686997
Road to Mahout v1.0
● Guiding Principles
   ○ Use the stable Hadoop API
   ○ Make vector the de-factor input format for all parts of code
   ○ Provide stable API for developers
Get Started
● http://mahout.apache.org
● dev@mahout.apache.org - Developer mailing list
● user@mahout.apache.org - User mailing list
● Check out the documentations and wiki for quickstart
● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
Resources
● “Mahout in Action” Owen, Anil, Dunning, Friedman
  http://www.manning.com/owen


● “Taming Text” Ingersoll, Morton, Farris
  http://www.manning.com/ingersoll


● “Introducing Apache Mahout”
  http://www.ibm.com/developerworks/java/library/j-mahout/
Thanks to
● Apache Foundation
● Mahout Committers
● Google Summer of Code Organizers
● And Students
● Open source!


● And NHN for hosting this at Seoul!
● and the wonderful engineers (present and future) in the room.
References
● news.google.com
● Cat http://www.flickr.com/photos/gattou/3178745634/
● Dog http://www.flickr.com/photos/30800139@N04/3879737638/
● Milk Eggs Bread http://www.flickr.
  com/photos/nauright/4792775946/
● Amazon Recommendations
● twitter

More Related Content

PPTX
Machine Learning and Apache Mahout : An Introduction
PDF
Mahout Tutorial and Hands-on (version 2015)
KEY
Machine Learning with Apache Mahout
PPTX
Whats Right and Wrong with Apache Mahout
PPTX
Intro to Mahout -- DC Hadoop
PDF
Tutorial Mahout - Recommendation
PDF
Mahout classification presentation
PPT
Orchestrating the Intelligent Web with Apache Mahout
Machine Learning and Apache Mahout : An Introduction
Mahout Tutorial and Hands-on (version 2015)
Machine Learning with Apache Mahout
Whats Right and Wrong with Apache Mahout
Intro to Mahout -- DC Hadoop
Tutorial Mahout - Recommendation
Mahout classification presentation
Orchestrating the Intelligent Web with Apache Mahout

What's hot (20)

PPTX
Intro to Apache Mahout
PPTX
Apache Mahout 於電子商務的應用
PPTX
Apache mahout
PPT
Mahout part2
PDF
Introduction to Collaborative Filtering with Apache Mahout
PDF
Apache Mahout
PPT
Hands on Mahout!
PDF
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
PDF
Apache Mahout Tutorial - Recommendation - 2013/2014
PPTX
Mahout Introduction BarCampDC
PDF
An Introduction to Apache Hadoop, Mahout and HBase
PPTX
Apache Mahout
PDF
Mahout
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
mahout introduction
PPTX
Introduction to Apache Mahout
PDF
Apache Mahout Architecture Overview
PDF
Next directions in Mahout's recommenders
PDF
Big Data Analytics using Mahout
PPTX
Intro to Mahout
Intro to Apache Mahout
Apache Mahout 於電子商務的應用
Apache mahout
Mahout part2
Introduction to Collaborative Filtering with Apache Mahout
Apache Mahout
Hands on Mahout!
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Apache Mahout Tutorial - Recommendation - 2013/2014
Mahout Introduction BarCampDC
An Introduction to Apache Hadoop, Mahout and HBase
Apache Mahout
Mahout
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
mahout introduction
Introduction to Apache Mahout
Apache Mahout Architecture Overview
Next directions in Mahout's recommenders
Big Data Analytics using Mahout
Intro to Mahout
Ad

Similar to SDEC2011 Mahout - the what, the how and the why (20)

PDF
Recommendation engines
PDF
10 more lessons learned from building Machine Learning systems - MLConf
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
PDF
10 more lessons learned from building Machine Learning systems
PPT
Buidling large scale recommendation engine
PDF
Production-Ready BIG ML Workflows - from zero to hero
PDF
Wattpad - Spark Stories
PPTX
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
PDF
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
PDF
Scalable, good, cheap
PDF
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
PDF
Apache Spark 101 - Demi Ben-Ari
PPTX
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
PDF
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
PDF
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
PDF
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
PPTX
Evolve with laravel
PDF
Serverless Clojure and ML prototyping: an experience report
PDF
FlinkML - Big data application meetup
Recommendation engines
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
10 more lessons learned from building Machine Learning systems
Buidling large scale recommendation engine
Production-Ready BIG ML Workflows - from zero to hero
Wattpad - Spark Stories
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Scalable, good, cheap
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Apache Spark 101 - Demi Ben-Ari
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Evolve with laravel
Serverless Clojure and ML prototyping: an experience report
FlinkML - Big data application meetup
Ad

More from Korea Sdec (16)

KEY
SDEC2011 Big engineer vs small entreprenuer
PDF
SDEC2011 Implementing me2day friend suggestion
PDF
SDEC2011 Introducing Hadoop
PDF
Sdec2011 shashank-introducing hadoop
PDF
SDEC2011 NoSQL Data modelling
PDF
SDEC2011 Essentials of Pig
PDF
SDEC2011 Essentials of Mahout
PDF
SDEC2011 Essentials of Hive
PDF
SDEC2011 NoSQL concepts and models
ZIP
Sdec2011 Introducing Hadoop
PDF
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
PDF
SDEC2011 Rapidant
PDF
SDEC2011 Going by TACC
PDF
SDEC2011 Glory-FS development & Experiences
PDF
SDEC2011 Using Couchbase for social game scaling and speed
PDF
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Introducing Hadoop
Sdec2011 shashank-introducing hadoop
SDEC2011 NoSQL Data modelling
SDEC2011 Essentials of Pig
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of Hive
SDEC2011 NoSQL concepts and models
Sdec2011 Introducing Hadoop
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Rapidant
SDEC2011 Going by TACC
SDEC2011 Glory-FS development & Experiences
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Arcus NHN memcached cloud

Recently uploaded (20)

PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
STKI Israel Market Study 2025 version august
PPT
Module 1.ppt Iot fundamentals and Architecture
PPT
What is a Computer? Input Devices /output devices
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Unlock new opportunities with location data.pdf
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A review of recent deep learning applications in wood surface defect identifi...
DOCX
search engine optimization ppt fir known well about this
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
August Patch Tuesday
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Benefits of Physical activity for teenagers.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
STKI Israel Market Study 2025 version august
Module 1.ppt Iot fundamentals and Architecture
What is a Computer? Input Devices /output devices
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A novel scalable deep ensemble learning framework for big data classification...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Unlock new opportunities with location data.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Chapter 5: Probability Theory and Statistics
A review of recent deep learning applications in wood surface defect identifi...
search engine optimization ppt fir known well about this
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Developing a website for English-speaking practice to English as a foreign la...
August Patch Tuesday
Taming the Chaos: How to Turn Unstructured Data into Decisions

SDEC2011 Mahout - the what, the how and the why

  • 1. The what, the why and the how Speaker: Robin Anil, Apache Mahout PMC Member SDEC, Seoul, Korea, June 2011
  • 2. about:me ● Apache Mahout PMC member ● A ML Believer ● Author of Mahout in Action ● Software Engineer @ Google *in no particular order ● Previous Life: Google Summer of Code student for 2 years.
  • 3. about:agenda ● Introducing Mahout ● Why Mahout? ● Birds eye view of Mahout ● Classic Machine learning problems ● Short overview of Mahout clustering
  • 4. about:mission To build a scalable machine learning library
  • 5. Scale! ● Scale to large datasets ○ Hadoop MapReduce implementations that scales linearly with data. ○ Fast sequential algorithms whose runtime doesn’t depend on the size of the data ○ Goal: To be as fast as possible for any algorithm ● Scalable to support your business case ○ Apache Software License 2 ● Scalable community ○ Vibrant, responsive and diverse ○ Come to the mailing list and find out more
  • 6. about:why ● Lack community ● Lack scalability ● Lack documentations and examples ● Lack Apache licensing ● Are not well tested ● Are Research oriented ● Not built over existing production quality libraries
  • 7. Birds eye view of Mahout ● If you want to: ○ Encode ○ Analyze ○ Predict ○ Get top best
  • 8. about:encode ● Process data and convert to vectors ● Dictionary based v/s Randomizer based ● Get best signals for generating vectors ○ Collocation information (ngrams) ○ Lp Normalization Data Engineering Camp 1:1.0 2:1.0 3:1.0
  • 9. about:analyze ● Cluster and group data to ● Cluster data ○ K-Means ○ Fuzzy K-Means ○ Canopy ○ Mean Shift ○ Dirichlet process clustering ○ Spectral Clustering ● Co-cluster features / dimensionality reduction ○ Latent Dirichlet Allocation (LDA) ○ Singular Value Decomposition
  • 11. about:lda ● Grouping similar or co-occurring features into a topic ○ Topic “Lol Cat”: ■ Cat ■ Meow ■ Purr ■ Haz ■ Cheeseburger ■ Lol
  • 12. about:predict ● Classification and Recommendation ● Classification: ○ Use features learn model ○ Apply model on unknown ● Recommendation ○ Use pairwise(user-item) information to learn model ○ For a given user return highly likely items
  • 13. about:classify ● Predicting the type of a new object based on its features ● The types are predetermined Dog Cat
  • 14. about:classify ● Plenty of algorithms ○ Naïve Bayes ○ Complementary Naïve Bayes ○ Random Forests ○ Stocastic Gradient Descent(regression) ● Learn a model from a manually classified data ● Predict the class of a new object based on its features and the learned model
  • 15. about:recommend ● Predict what the user likes based on ○ His/Her historical behavior ○ Aggregate behavior of people similar to him
  • 16. about:recommend ● Different types of recommenders ○ User based ○ Item based ○ Co-occurrence based ● Full framework for storage, online online and offline computation of recommendations ● Like clustering, there is a notion of similarity in users or items ○ Cosine, Tanimoto, Pearson and LLR
  • 17. about:top-n ● Frequent Pattern Mining: ○ Identify top-K patterns
  • 18. about:frequent-pattern-mining ● Find interesting groups of items based on how they co-occur in a dataset
  • 19. about:parallel-fp-growth ● Identify the most commonly occurring patterns from ○ Sales Transactions buy “Milk, eggs and bread” ○ Query Logs ipad -> apple, tablet, iphone ○ Spam Detection Yahoo! http://www.slideshare.net/hadoopusergroup/mail- antispam
  • 20. summary:in-short ● Plenty of overlap. There is no one algorithm to fit all problems. ● Analyze and iterate fast ● MapReduce implementations makes these Fly!
  • 21. Did you know? ● Apache Mahout uses Colt high-performance collections ○ Open HashMaps instead of Chained HashMaps ○ Arrays of Primitive types ○ Available as Mahout Math library ● Mahout Vector uses integer encoding techniques to reduce space. ● Fastest classifier in Mahout doesn’t use MapReduce! ○ And it learns online ○ And It doesn’t look at all the data.
  • 22. How to use mahout ● Command line launcher bin/mahout ● See the list of tools and algorithms by running bin/mahout ● Run any algorithm by its shortname: ○ bin/mahout kmeans –help ● By default runs locally ● export HADOOP_HOME = /pathto/hadoop-0.20.2/ ○ Runs on the cluster configured as per the conf files in the hadoop directory ● Use driver classes to launch jobs: ○ KMeansDriver.runjob(Path input, Path output …)
  • 23. Clustering Walkthrough (tiny example) ● Input: set of text files in a directory ● Download Mahout and unzip ○ mvn install ○ bin/mahout seqdirectory –i <input> –o <seq-output> ○ bin/mahout seq2sparse –i seq-output –o <vector-output> ○ bin/mahout kmeans –i<vector-output> -c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20
  • 24. Clustering Walkthrough (a bit more) ● Use bigrams: -ng 2 ● Prune low frequency: –s 10 ● Normalize: -n 2 ● Use a distance measure : -dm org.apache.mahout.common. distance.CosineDistanceMeasure
  • 25. Clustering Walkthrough (viewing results) ● bin/mahout clusterdump –s cluster-output/clusters-9/part-00000 -d vector-output/dictionary.file-* -dt sequencefile -n 5 -b 100 ● Top terms in a typical cluster comic => 9.793121272867376 comics => 6.115341078151356 con => 5.015090566692931 sdcc => 3.927590843402978 webcomics => 2.916910980686997
  • 26. Road to Mahout v1.0 ● Guiding Principles ○ Use the stable Hadoop API ○ Make vector the de-factor input format for all parts of code ○ Provide stable API for developers
  • 27. Get Started ● http://mahout.apache.org ● dev@mahout.apache.org - Developer mailing list ● user@mahout.apache.org - User mailing list ● Check out the documentations and wiki for quickstart ● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
  • 28. Resources ● “Mahout in Action” Owen, Anil, Dunning, Friedman http://www.manning.com/owen ● “Taming Text” Ingersoll, Morton, Farris http://www.manning.com/ingersoll ● “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/
  • 29. Thanks to ● Apache Foundation ● Mahout Committers ● Google Summer of Code Organizers ● And Students ● Open source! ● And NHN for hosting this at Seoul! ● and the wonderful engineers (present and future) in the room.
  • 30. References ● news.google.com ● Cat http://www.flickr.com/photos/gattou/3178745634/ ● Dog http://www.flickr.com/photos/30800139@N04/3879737638/ ● Milk Eggs Bread http://www.flickr. com/photos/nauright/4792775946/ ● Amazon Recommendations ● twitter