SDEC2011 Mahout - the what, the how and the why

The what, the why and the how
Speaker: Robin Anil, Apache Mahout PMC Member

SDEC, Seoul, Korea, June 2011

about:me
● Apache Mahout PMC member
● A ML Believer
● Author of Mahout in Action
● Software Engineer @ Google

*in no particular order

● Previous Life: Google Summer of Code student for 2 years.

about:agenda
● Introducing Mahout
● Why Mahout?
● Birds eye view of Mahout
● Classic Machine learning problems
● Short overview of Mahout clustering

about:mission

To build a scalable machine learning library

Scale!
● Scale to large datasets
○ Hadoop MapReduce implementations that scales linearly
with data.
○ Fast sequential algorithms whose runtime doesn’t depend
on the size of the data
○ Goal: To be as fast as possible for any algorithm
● Scalable to support your business case
○ Apache Software License 2
● Scalable community
○ Vibrant, responsive and diverse
○ Come to the mailing list and find out more

about:why
● Lack community
● Lack scalability
● Lack documentations and examples
● Lack Apache licensing
● Are not well tested
● Are Research oriented
● Not built over existing production quality libraries

Birds eye view of Mahout
● If you want to:
○ Encode
○ Analyze
○ Predict
○ Get top best

about:encode
● Process data and convert to vectors
● Dictionary based v/s Randomizer based
● Get best signals for generating vectors
○ Collocation information (ngrams)
○ Lp Normalization

Data Engineering Camp

1:1.0 2:1.0 3:1.0

about:analyze
● Cluster and group data to
● Cluster data
○ K-Means
○ Fuzzy K-Means
○ Canopy
○ Mean Shift
○ Dirichlet process clustering
○ Spectral Clustering
● Co-cluster features / dimensionality reduction
○ Latent Dirichlet Allocation (LDA)
○ Singular Value Decomposition

about:clustering
News clusters

about:lda
● Grouping similar or co-occurring features into a topic
○ Topic “Lol Cat”:
■ Cat
■ Meow
■ Purr
■ Haz
■ Cheeseburger
■ Lol

about:predict
● Classification and Recommendation

● Classification:
○ Use features learn model
○ Apply model on unknown
● Recommendation
○ Use pairwise(user-item) information to learn model
○ For a given user return highly likely items

about:classify
● Predicting the type of a new object based on its features
● The types are predetermined

Dog Cat

about:classify
● Plenty of algorithms
○ Naïve Bayes
○ Complementary Naïve Bayes
○ Random Forests
○ Stocastic Gradient Descent(regression)
● Learn a model from a manually classified data
● Predict the class of a new object based on its
features and the learned model

about:recommend
● Predict what the user likes based on
○ His/Her historical behavior
○ Aggregate behavior of people similar to him

about:recommend
● Different types of recommenders
○ User based
○ Item based
○ Co-occurrence based
● Full framework for storage, online
online and offline computation of recommendations
● Like clustering, there is a notion of similarity in users or items
○ Cosine, Tanimoto, Pearson and LLR

about:top-n
● Frequent Pattern Mining:
○ Identify top-K patterns

about:frequent-pattern-mining
● Find interesting groups of items based on how they co-occur in
a dataset

about:parallel-fp-growth
● Identify the most commonly
occurring patterns from
○ Sales Transactions
buy “Milk, eggs and bread”
○ Query Logs
ipad -> apple, tablet, iphone
○ Spam Detection
Yahoo! http://www.slideshare.net/hadoopusergroup/mail-
antispam

summary:in-short
● Plenty of overlap. There is no one algorithm to fit all problems.
● Analyze and iterate fast
● MapReduce implementations makes these Fly!

Did you know?
● Apache Mahout uses Colt high-performance collections
○ Open HashMaps instead of Chained HashMaps
○ Arrays of Primitive types
○ Available as Mahout Math library
● Mahout Vector uses integer encoding techniques to reduce
space.
● Fastest classifier in Mahout doesn’t use MapReduce!
○ And it learns online
○ And It doesn’t look at all the data.

How to use mahout
● Command line launcher bin/mahout
● See the list of tools and algorithms by running bin/mahout
● Run any algorithm by its shortname:
○ bin/mahout kmeans –help
● By default runs locally
● export HADOOP_HOME = /pathto/hadoop-0.20.2/
○ Runs on the cluster configured as per the conf files in the
hadoop directory
● Use driver classes to launch jobs:
○ KMeansDriver.runjob(Path input, Path output …)

Clustering Walkthrough (tiny example)
● Input: set of text files in a directory
● Download Mahout and unzip
○ mvn install
○ bin/mahout seqdirectory –i <input> –o <seq-output>
○ bin/mahout seq2sparse –i seq-output –o <vector-output>
○ bin/mahout kmeans –i<vector-output>
-c <cluster-temp> -o <cluster-output> -k 10 –cd 0.01 –x 20

Clustering Walkthrough (a bit more)

● Use bigrams: -ng 2
● Prune low frequency: –s 10
● Normalize: -n 2

● Use a distance measure : -dm org.apache.mahout.common.
distance.CosineDistanceMeasure

Clustering Walkthrough (viewing results)
● bin/mahout clusterdump
–s cluster-output/clusters-9/part-00000
-d vector-output/dictionary.file-*
-dt sequencefile -n 5 -b 100
● Top terms in a typical cluster
comic => 9.793121272867376
comics => 6.115341078151356
con => 5.015090566692931
sdcc => 3.927590843402978
webcomics => 2.916910980686997

Road to Mahout v1.0
● Guiding Principles
○ Use the stable Hadoop API
○ Make vector the de-factor input format for all parts of code
○ Provide stable API for developers

Get Started
● http://mahout.apache.org
● dev@mahout.apache.org - Developer mailing list
● user@mahout.apache.org - User mailing list
● Check out the documentations and wiki for quickstart
● http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code

Resources
● “Mahout in Action” Owen, Anil, Dunning, Friedman
http://www.manning.com/owen

● “Taming Text” Ingersoll, Morton, Farris
http://www.manning.com/ingersoll

● “Introducing Apache Mahout”
http://www.ibm.com/developerworks/java/library/j-mahout/

Thanks to
● Apache Foundation
● Mahout Committers
● Google Summer of Code Organizers
● And Students
● Open source!

● And NHN for hosting this at Seoul!
● and the wonderful engineers (present and future) in the room.

References
● news.google.com
● Cat http://www.flickr.com/photos/gattou/3178745634/
● Dog http://www.flickr.com/photos/30800139@N04/3879737638/
● Milk Eggs Bread http://www.flickr.
com/photos/nauright/4792775946/
● Amazon Recommendations
● twitter

SDEC2011 Mahout - the what, the how and the why

More Related Content

What's hot (20)

Similar to SDEC2011 Mahout - the what, the how and the why (20)

More from Korea Sdec (16)

Recently uploaded (20)

SDEC2011 Mahout - the what, the how and the why