SlideShare a Scribd company logo
Introduction to Scalable Machine Learning with Apache Mahout Grant IngersollFebruary 15, 2010
IntroductionYouMachine learning experience?Business Intelligence?Natural Lang. Processing?ApacheHadoop?MeCo-founder Apache MahoutApache Lucene/Solr committerCo-founder Lucid Imagination
TopicsWhat is Machine Learning?ML Use CasesWhat is Mahout?What can I do with it right now?Where’s Mahout headed?
Amazon.comWhat is Machine Learning?Google News
Really it’s…“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?
Useful TerminologyVectors/MatricesWeightsSparseDenseNormsFeaturesFeature reductionOccurrences and Cooccurrences
Getting Started with MLGet your dataDecide on your features per your algorithmPrep the dataDifferent approaches for different algorithmsRun your algorithm(s)Lather, rinse, repeatValidate your resultsSmell test, A/B testing, more formal methods
Apache Mahouthttp://dictionary.reference.com/browse/mahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software LicenseWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache License ;-)Or are research-oriented
Focus: Machine LearningApplicationsExamplesRecommendersClusteringClassificationFreq. PatternMiningGeneticMathVectors/Matrices/SVDUtilitiesLucene/VectorizerCollections (primitives)Apache HadoopSee http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Focus: ScalableGoal: Be as fast and efficient as the possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need other distributed programming modelsBe pragmaticMost Mahout implementations are Map Reduce enabledWork in Progress
Prepare Data from Raw contentData Sources:Lucene integrationbin/mahout lucenevector …Document Vectorizerbin/mahout seqdirectory …bin/mahout seq2sparse …ProgrammaticallySee the Utils module in MahoutDatabaseFile system
RecommendationsExtensive framework for collaborative filteringRecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, others
ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-ShiftDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation
CategorizationPlace new items into predefined categories:Sports, politics, entertainmentMahout has several implementationsNaïve BayesComplementary Naïve BayesDecision Forests
Freq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placement“Beer and Diapers”http://www.amazon.com
EvolutionaryMap-Reduce ready fitness functions for genetic programmingIntegration with Watchmakerhttp://watchmaker.uncommons.org/index.phpProblems solved:Traveling salesmanClass discoveryMany others
How To: RecommendersData: Users (abstract)Items (abstract)Ratings (optional)Load the data modelAsk for Recommendations:User-UserItem-Item
Ugly Demo IGroup Lens Data: http://www.grouplens.orghttp://lucene.apache.org/mahout/taste.html#demohttp://localhost:8080/RecommenderServlet?userID=1&debug=trueIn other words:  the reason why I work on servers, not UIs!
How to: Command LineMost algorithms have a Driver programShell script in $MAHOUT_HOME/bin helps with most tasksPrepare the DataDifferent algorithms require different setupRun the algorithmSingle NodeHadoopPrint out the resultsSeveral helper classes: LDAPrintTopics, ClusterDumper, etc.
Ugly Demo II - PrepData Set: Reutershttp://www.daviddlewis.com/resources/testcollections/reuters21578/Convert to Text via http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/Convert to Sequence File:bin/mahout seqdirectory –input <PATH> --output <PATH> --charset UTF-8Convert to Sparse Vector:bin/mahout seq2sparse --input <PATH>/content/reuters/seqfiles/ --norm 2 --weight TF --output <PATH>/content/reuters/seqfiles-TF/ --minDF 5 --maxDFPercent 90
Ugly Demo II: Topic ModelingLatent Dirichlet Allocation./mahout lda --input  <PATH>/content/reuters/seqfiles-TF/vectors/ --output  <PATH>/content/reuters/seqfiles-TF/lda-output --numWords 34000 –numTopics 10./mahout org.apache.mahout.clustering.lda.LDAPrintTopics --input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 --dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 --words 10 --output <PATH>/content/reuters/seqfiles-TF/lda-output/topics --dictionaryTypesequencefileGood feature reduction (stopword removal) required
Ugly Demo III: ClusteringK-MeansSame Prep as UD II, except use TFIDF weight./mahout kmeans --input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 --k 15 --output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans --clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clustersPrint out the clusters: ./mahout clusterdump --seqFileDir <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-15/ --pointsDir <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/points/ --dictionary <PATH>/content/reuters/seqfiles-TFIDF/dictionary.file-0 --dictionaryTypesequencefile --substring 20
Ugly Demo IV: Frequent Pattern MiningData: http://fimi.cs.helsinki.fi/data/./mahout fpg -i <PATH>/content/freqitemset/accidents.dat -o patterns -k 50 -method mapreduce -g 10 -regex [\ ] ./mahout seqdump --seqFile patterns/fpgrowth/part-r-00000
What’s Next?0.3 release very soonParallel Singular Value Decomposition (Lanczos)Stabilize API’s for 1.0 releaseBenchmarkingGoogle Summer of Code?More Algorithmshttp://cwiki.apache.org/MAHOUT/howtocontribute.html
ResourcesSlides and Full Details of Demos at:http://lucene.grantingersoll.com/2010/02/13/intro-to-mahout-slides-and-demo-examples/More Examples in Mahout SVN in the examples directory
Resourceshttp://lucene.apache.org/mahouthttp://cwiki.apache.org/MAHOUTmahout-{user|dev}@lucene.apache.orghttp://svn.apache.org/repos/asf/lucene/mahout/trunkhttp://hadoop.apache.org
Resources“Mahout in Action” by Owen and Anil“Introducing Apache Mahout”http://www.ibm.com/developerworks/java/library/j-mahout/“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
ReferencesHAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpgTerminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpgMatrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpgGoogle News: http://news.google.comAmazon.com: http://www.amazon.comFacebook: http://www.facebook.comMahout: http://lucene.apache.org/mahoutBeer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/http://www.theregister.co.uk/2006/08/15/beer_diapers/DMOZ: http://www.dmoz.org

More Related Content

PPTX
Apache Mahout
PPTX
Introduction to Apache Mahout
PDF
An Introduction to Apache Hadoop, Mahout and HBase
PDF
Mahout
PPTX
Intro to Apache Mahout
PDF
Apache Mahout
PPTX
Apache mahout
PPTX
Intro to Mahout -- DC Hadoop
Apache Mahout
Introduction to Apache Mahout
An Introduction to Apache Hadoop, Mahout and HBase
Mahout
Intro to Apache Mahout
Apache Mahout
Apache mahout
Intro to Mahout -- DC Hadoop

What's hot (20)

PPTX
Machine Learning and Apache Mahout : An Introduction
PDF
Introduction to Collaborative Filtering with Apache Mahout
PDF
Mahout Tutorial and Hands-on (version 2015)
PDF
SDEC2011 Mahout - the what, the how and the why
PPTX
Apache Mahout: Driving the Yellow Elephant
PPTX
Whats Right and Wrong with Apache Mahout
PDF
Tutorial Mahout - Recommendation
PPTX
Apache Mahout 於電子商務的應用
KEY
Machine Learning with Apache Mahout
PPTX
Mahout Introduction BarCampDC
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPT
Orchestrating the Intelligent Web with Apache Mahout
PDF
Mahout classification presentation
PPT
Mahout part2
PPT
Hands on Mahout!
PDF
Apache Mahout Architecture Overview
PDF
Apache Mahout Tutorial - Recommendation - 2013/2014
PDF
Next directions in Mahout's recommenders
PDF
OSCON: Apache Mahout - Mammoth Scale Machine Learning
PPTX
Using the search engine as recommendation engine
Machine Learning and Apache Mahout : An Introduction
Introduction to Collaborative Filtering with Apache Mahout
Mahout Tutorial and Hands-on (version 2015)
SDEC2011 Mahout - the what, the how and the why
Apache Mahout: Driving the Yellow Elephant
Whats Right and Wrong with Apache Mahout
Tutorial Mahout - Recommendation
Apache Mahout 於電子商務的應用
Machine Learning with Apache Mahout
Mahout Introduction BarCampDC
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Orchestrating the Intelligent Web with Apache Mahout
Mahout classification presentation
Mahout part2
Hands on Mahout!
Apache Mahout Architecture Overview
Apache Mahout Tutorial - Recommendation - 2013/2014
Next directions in Mahout's recommenders
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Using the search engine as recommendation engine
Ad

Similar to mahout introduction (20)

PDF
Mahout and Distributed Machine Learning 101
PPTX
Classification with Naive Bayes
PPTX
Machine Learning and Hadoop
PDF
NYC_2016_slides
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
PPTX
Populate your Search index, NEST 2016-01
DOCX
Hadoop Based Data Discovery
PDF
Practical Machine Learning Tackle The Realworld Complexities Of Modern Machin...
PPTX
Recommendation engine
PPTX
Apache mahout and R-mining complex dataobject
PPT
scale_perf_best_practices
PDF
May 29, 2014 Toronto Hadoop User Group - Micro ETL
PPT
GTU MCA PHP Interview Questions And Answers for freshers
PPT
Hive @ Hadoop day seattle_2010
DOCX
Vipul divyanshu mahout_documentation
PPTX
A look at Apache OODT Balance framework
PDF
Survey Paper on Big Data and Hadoop
PPT
Recommender.system.presentation.pjug.05.20.2014
PDF
Data science technology overview
PPT
Hive Training -- Motivations and Real World Use Cases
Mahout and Distributed Machine Learning 101
Classification with Naive Bayes
Machine Learning and Hadoop
NYC_2016_slides
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Populate your Search index, NEST 2016-01
Hadoop Based Data Discovery
Practical Machine Learning Tackle The Realworld Complexities Of Modern Machin...
Recommendation engine
Apache mahout and R-mining complex dataobject
scale_perf_best_practices
May 29, 2014 Toronto Hadoop User Group - Micro ETL
GTU MCA PHP Interview Questions And Answers for freshers
Hive @ Hadoop day seattle_2010
Vipul divyanshu mahout_documentation
A look at Apache OODT Balance framework
Survey Paper on Big Data and Hadoop
Recommender.system.presentation.pjug.05.20.2014
Data science technology overview
Hive Training -- Motivations and Real World Use Cases
Ad

mahout introduction

  • 1. Introduction to Scalable Machine Learning with Apache Mahout Grant IngersollFebruary 15, 2010
  • 2. IntroductionYouMachine learning experience?Business Intelligence?Natural Lang. Processing?ApacheHadoop?MeCo-founder Apache MahoutApache Lucene/Solr committerCo-founder Lucid Imagination
  • 3. TopicsWhat is Machine Learning?ML Use CasesWhat is Mahout?What can I do with it right now?Where’s Mahout headed?
  • 4. Amazon.comWhat is Machine Learning?Google News
  • 5. Really it’s…“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
  • 6. Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?
  • 8. Getting Started with MLGet your dataDecide on your features per your algorithmPrep the dataDifferent approaches for different algorithmsRun your algorithm(s)Lather, rinse, repeatValidate your resultsSmell test, A/B testing, more formal methods
  • 9. Apache Mahouthttp://dictionary.reference.com/browse/mahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software LicenseWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache License ;-)Or are research-oriented
  • 10. Focus: Machine LearningApplicationsExamplesRecommendersClusteringClassificationFreq. PatternMiningGeneticMathVectors/Matrices/SVDUtilitiesLucene/VectorizerCollections (primitives)Apache HadoopSee http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
  • 11. Focus: ScalableGoal: Be as fast and efficient as the possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need other distributed programming modelsBe pragmaticMost Mahout implementations are Map Reduce enabledWork in Progress
  • 12. Prepare Data from Raw contentData Sources:Lucene integrationbin/mahout lucenevector …Document Vectorizerbin/mahout seqdirectory …bin/mahout seq2sparse …ProgrammaticallySee the Utils module in MahoutDatabaseFile system
  • 13. RecommendationsExtensive framework for collaborative filteringRecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, others
  • 14. ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-ShiftDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation
  • 15. CategorizationPlace new items into predefined categories:Sports, politics, entertainmentMahout has several implementationsNaïve BayesComplementary Naïve BayesDecision Forests
  • 16. Freq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placement“Beer and Diapers”http://www.amazon.com
  • 17. EvolutionaryMap-Reduce ready fitness functions for genetic programmingIntegration with Watchmakerhttp://watchmaker.uncommons.org/index.phpProblems solved:Traveling salesmanClass discoveryMany others
  • 18. How To: RecommendersData: Users (abstract)Items (abstract)Ratings (optional)Load the data modelAsk for Recommendations:User-UserItem-Item
  • 19. Ugly Demo IGroup Lens Data: http://www.grouplens.orghttp://lucene.apache.org/mahout/taste.html#demohttp://localhost:8080/RecommenderServlet?userID=1&debug=trueIn other words: the reason why I work on servers, not UIs!
  • 20. How to: Command LineMost algorithms have a Driver programShell script in $MAHOUT_HOME/bin helps with most tasksPrepare the DataDifferent algorithms require different setupRun the algorithmSingle NodeHadoopPrint out the resultsSeveral helper classes: LDAPrintTopics, ClusterDumper, etc.
  • 21. Ugly Demo II - PrepData Set: Reutershttp://www.daviddlewis.com/resources/testcollections/reuters21578/Convert to Text via http://www.lucenebootcamp.com/lucene-boot-camp-preclass-training/Convert to Sequence File:bin/mahout seqdirectory –input <PATH> --output <PATH> --charset UTF-8Convert to Sparse Vector:bin/mahout seq2sparse --input <PATH>/content/reuters/seqfiles/ --norm 2 --weight TF --output <PATH>/content/reuters/seqfiles-TF/ --minDF 5 --maxDFPercent 90
  • 22. Ugly Demo II: Topic ModelingLatent Dirichlet Allocation./mahout lda --input <PATH>/content/reuters/seqfiles-TF/vectors/ --output <PATH>/content/reuters/seqfiles-TF/lda-output --numWords 34000 –numTopics 10./mahout org.apache.mahout.clustering.lda.LDAPrintTopics --input <PATH>/content/reuters/seqfiles-TF/lda-output/state-19 --dict <PATH>/content/reuters/seqfiles-TF/dictionary.file-0 --words 10 --output <PATH>/content/reuters/seqfiles-TF/lda-output/topics --dictionaryTypesequencefileGood feature reduction (stopword removal) required
  • 23. Ugly Demo III: ClusteringK-MeansSame Prep as UD II, except use TFIDF weight./mahout kmeans --input <PATH>/content/reuters/seqfiles-TFIDF/vectors/part-00000 --k 15 --output <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans --clusters <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clustersPrint out the clusters: ./mahout clusterdump --seqFileDir <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/clusters-15/ --pointsDir <PATH>/content/reuters/seqfiles-TFIDF/output-kmeans/points/ --dictionary <PATH>/content/reuters/seqfiles-TFIDF/dictionary.file-0 --dictionaryTypesequencefile --substring 20
  • 24. Ugly Demo IV: Frequent Pattern MiningData: http://fimi.cs.helsinki.fi/data/./mahout fpg -i <PATH>/content/freqitemset/accidents.dat -o patterns -k 50 -method mapreduce -g 10 -regex [\ ] ./mahout seqdump --seqFile patterns/fpgrowth/part-r-00000
  • 25. What’s Next?0.3 release very soonParallel Singular Value Decomposition (Lanczos)Stabilize API’s for 1.0 releaseBenchmarkingGoogle Summer of Code?More Algorithmshttp://cwiki.apache.org/MAHOUT/howtocontribute.html
  • 26. ResourcesSlides and Full Details of Demos at:http://lucene.grantingersoll.com/2010/02/13/intro-to-mahout-slides-and-demo-examples/More Examples in Mahout SVN in the examples directory
  • 28. Resources“Mahout in Action” by Owen and Anil“Introducing Apache Mahout”http://www.ibm.com/developerworks/java/library/j-mahout/“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
  • 29. ReferencesHAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpgTerminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpgMatrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpgGoogle News: http://news.google.comAmazon.com: http://www.amazon.comFacebook: http://www.facebook.comMahout: http://lucene.apache.org/mahoutBeer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/http://www.theregister.co.uk/2006/08/15/beer_diapers/DMOZ: http://www.dmoz.org

Editor's Notes

  • #3: Hadoop experience? ML experience?
  • #5: A few things come to mind
  • #9: Think about data differently than traditional DB