SlideShare a Scribd company logo
1©MapR Technologies 2013- Confidential
Introduction to Mahout
And How To Build a Recommender
2©MapR Technologies 2013- Confidential
Me, Us
 Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
 MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
 Tonight
Hash tag - #tchug
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
3©MapR Technologies 2013- Confidential
Sidebar on Drill
 Apache Drill
– SQL on Hadoop (and other things)
– Intended to solve problems for 1-5 years from now
Not the problems from 1-10 years ago
– Multiple levels of API supported
• SQL-2003
• Logical plan language (DAG in JSON)
• Physical plan language (DAG with push-down, exchange markers)
• Execution plan language (many DAG’s)
 Current state
– SQL 2003 support in place
– Logical plan interpreter useful for testing
– Value vectors near completion
– High performance RPC working
4©MapR Technologies 2013- Confidential
More on Drill
 Just completed OSCON workshop
 Workshop materials available shortly
– Extracted technology demonstrators
– Sample queries
 Send me email or tweet for more info
5©MapR Technologies 2013- Confidential
What’s Up?
 What is Mahout?
– Math library
– Clustering, classifiers, other stuff
 Recommendation
– Generalities
– Algorithm Specifics
– System Design
– Important things never mentioned
 Final thoughts
6©MapR Technologies 2013- Confidential
What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
 Components
– math library
– clustering
– classification
– decompositions
– recommendations
7©MapR Technologies 2013- Confidential
What is Mahout?
 “Scalable machine learning”
– not just Hadoop-oriented machine learning
– not entirely, that is. Just mostly.
 Components
– math library
– clustering
– classification
– decompositions
– recommendations
8©MapR Technologies 2013- Confidential
Mahout Math
9©MapR Technologies 2013- Confidential
Mahout Math
 Goals are
– basic linear algebra,
– and statistical sampling,
– and good clustering,
– decent speed,
– extensibility,
– especially for sparse data
 But not
– totally badass speed
– comprehensive set of algorithms
– optimization, root finders, quadrature
10©MapR Technologies 2013- Confidential
Matrices and Vectors
 At the core:
– DenseVector, RandomAccessSparseVector
– DenseMatrix, SparseRowMatrix
 Highly composable API
 Important ideas:
– view*, assign and aggregate
– iteration
m.viewDiagonal().assign(v)
11©MapR Technologies 2013- Confidential
Assign? View?
 Why assign?
– Copying is the major cost for naïve matrix packages
– In-place operations critical to reasonable performance
– Many kinds of updates required, so functional style very helpful
 Why view?
– In-place operations often required for blocks, rows, columns or diagonals
– With views, we need #assign + #views methods
– Without views, we need #assign x #views methods
 Synergies
– With both views and assign, many loops become single line
12©MapR Technologies 2013- Confidential
Assign
 Matrices
 Vectors
Matrix assign(double value);
Matrix assign(double[][] values);
Matrix assign(Matrix other);
Matrix assign(DoubleFunction f);
Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);
Vector assign(double[] values);
Vector assign(Vector other);
Vector assign(DoubleFunction f);
Vector assign(Vector other, DoubleDoubleFunction f);
Vector assign(DoubleDoubleFunction f, double y);
13©MapR Technologies 2013- Confidential
Views
 Matrices
 Vectors
Matrix viewPart(int[] offset, int[] size);
Matrix viewPart(int row, int rlen, int col, int clen);
Vector viewRow(int row);
Vector viewColumn(int column);
Vector viewDiagonal();
Vector viewPart(int offset, int length);
14©MapR Technologies 2013- Confidential
Aggregates
 Matrices
 Vectors
double zSum();
double aggregate(
DoubleDoubleFunction reduce, DoubleFunction map);
double aggregate(Vector other,
DoubleDoubleFunction aggregator,
DoubleDoubleFunction combiner);
double zSum();
Vector aggregateRows(VectorFunction f);
Vector aggregateColumns(VectorFunction f);
double aggregate(DoubleDoubleFunction combiner,
DoubleFunction mapper);
15©MapR Technologies 2013- Confidential
Predefined Functions
 Many handy functions
ABS LOG2
ACOS NEGATE
ASIN RINT
ATAN SIGN
CEIL SIN
COS SQRT
EXP SQUARE
FLOOR SIGMOID
IDENTITY SIGMOIDGRADIENT
INV TAN
LOGARITHM
16©MapR Technologies 2013- Confidential
Examples
double alpha; a.assign(alpha);
a.assign(b, Functions.chain(
Functions.plus(beta),
Functions.times(alpha));
A =a
A =aB+ b
17©MapR Technologies 2013- Confidential
Sparse Optimizations
 DoubleDoubleFunction abstract properties
 And Vector properties
public boolean isLikeRightPlus();
public boolean isLikeLeftMult();
public boolean isLikeRightMult();
public boolean isLikeMult();
public boolean isCommutative();
public boolean isAssociative();
public boolean isAssociativeAndCommutative();
public boolean isDensifying();
public boolean isDense();
public boolean isSequentialAccess();
public double getLookupCost();
public double getIteratorAdvanceCost();
public boolean isAddConstantTime();
18©MapR Technologies 2013- Confidential
More Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums
19©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums
m.viewDiagonal().zSum()
20©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums
m.viewDiagonal().zSum()
m.viewDiagonal().assign(0)
21©MapR Technologies 2013- Confidential
Examples
 The trace of a matrix
 Set diagonal to zero
 Set diagonal to negative of row sums excluding the diagonal
m.viewDiagonal().zSum()
m.viewDiagonal().assign(0)
Vector diag = m.viewDiagonal().assign(0);
diag.assign(m.rowSums().assign(Functions.MINUS));
22©MapR Technologies 2013- Confidential
Iteration
 Matrices are Iterable in Mahout
 Vectors are densely or sparsely iterable
// compute both row and columns sums in one pass
for (MatrixSlice row: m) {
rSums.set(row.index(), row.zSum());
cSums.assign(row, Functions.PLUS);
}
double entropy = 0;
for (Vector.Element e: v.nonZeroes()) {
entropy += e.get() * Math.log(e.get());
}
23©MapR Technologies 2013- Confidential
Random Sampling
 Samples from some type
 Lots of kinds
ChineseRestaurant Missing Normal
Empirical Multinomial PoissonSampler
IndianBuffet MultiNormal Sampler
public interface Sampler<T> {
T sample();
}
public abstract class AbstractSamplerFunction
extends DoubleFunction
implements Sampler<Double>
24©MapR Technologies 2013- Confidential
Clustering and Such
 Streaming k-means and ball k-means
– streaming reduces very large data to a cluster sketch
– ball k-means is a high quality k-means implementation
– the cluster sketch is also usable for other applications
– single machine threaded and map-reduce versions available
 SVD and friends
– stochastic SVD has in-memory, single machine out-of-core and map-reduce
versions
– good for reducing very large sparse matrices to tall skinny dense ones
 Spectral clustering
– based on SVD, allows massive dimensional clustering
25©MapR Technologies 2013- Confidential
Mahout Math Summary
 Matrices, Vectors
– views
– in-place assignment
– aggregations
– iterations
 Functions
– lots built-in
– cooperate with sparse vector optimizations
 Sampling
– abstract samplers
– samplers as functions
 Other stuff … clustering, SVD
26©MapR Technologies 2013- Confidential
Recommenders
27©MapR Technologies 2013- Confidential
Recommendations
 Often known as collaborative filtering
 Actors interact with items
– observe successful interaction
 We want to suggest additional successful interactions
 Observations inherently very sparse
28©MapR Technologies 2013- Confidential
The Big Ideas
 Cooccurrence is the core operation (and it is pretty simple)
 Cooccurrence can be extended to handle important new
capabilities
 Recommendation systems can be deployed ideally using search
technology
29©MapR Technologies 2013- Confidential
Examples of Recommendations
 Customers buying books (Linden et al)
 Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix)
 Internet radio listeners not skipping songs (Musicmatch)
 Internet video watchers watching >30 s (Veoh)
 Visibility in a map UI (new Google maps)
30©MapR Technologies 2013- Confidential
A simple recommendation architecture
 Look at the history of interactions
 Find significant item cooccurrence in user histories
 Use these cooccurring items as “indicators”
 For all indicators in user history, accumulate scores for related
items
31©MapR Technologies 2013- Confidential
Recommendation Basics
 History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
32©MapR Technologies 2013- Confidential
Recommendation Basics
 History as matrix:
 (t1, t3) cooccur 2 times,
 (t1, t4) once,
 (t2, t4) once,
 (t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
33©MapR Technologies 2013- Confidential
A Quick Simplification
 Users who do h
 Also do r
Ah
AT
Ah( )
AT
A( )h
User-centric recommendations
Item-centric recommendations
34©MapR Technologies 2013- Confidential
Recommendation Basics
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
35©MapR Technologies 2013- Confidential
Problems with Raw Cooccurrence
 Very popular items co-occur with everything
– Welcome document
– Elevator music
 That isn’t interesting
– We want anomalous cooccurrence
36©MapR Technologies 2013- Confidential
Recommendation Basics
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
t3 not t3
t1 2 1
not t1 1 1
37©MapR Technologies 2013- Confidential
Spot the Anomaly
 Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.44 0.98
2.26 7.15
39©MapR Technologies 2013- Confidential
Threshold by Score
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
40©MapR Technologies 2013- Confidential
Threshold by Score
 Significant cooccurrence => Indicators
t1 t2 t3 t4
t1 1 0 0 1
t2 0 1 0 1
t3 0 0 1 1
t4 1 0 0 1
41©MapR Technologies 2013- Confidential
So Far, So Good
 Classic recommendation systems based on these approaches
– Musicmatch (ca 2000)
– Veoh Networks (ca 2005)
 Currently available in Mahout
– See RowSimilarityJob
 Very simple to deploy
– Compute indicators
– Store in search engine
– Works very well with enough data
42©MapR Technologies 2013- Confidential
What’s right
about this?
43©MapR Technologies 2013- Confidential
Virtues of Current State of the Art
 Lots of well publicized history
– Musicmatch, Veoh, Netflix, Amazon, Overstock
 Lots of support
– Mahout, commercial offerings like Myrrix
 Lots of existing code
– Mahout, commercial codes
 Proven track record
 Well socialized solution
44©MapR Technologies 2013- Confidential
What’s wrong
about this?
45©MapR Technologies 2013- Confidential
Problems for Recommenders
 Cold start
 Disjoint populations
 Long tail
 Multiple kinds of evidence (multi-modal recommendations)
– unstructured add-on data
– other transaction streams
– textual descriptions
46©MapR Technologies 2013- Confidential
What is this multi-modal stuff?
 But people don’t just do one thing
 One kind of behavior is useful for predicting other kinds
 Having a complete picture is important for accuracy
 What has the user said, viewed, clicked, closed, bought lately?
47©MapR Technologies 2013- Confidential
Example Multi-modal Inputs
 Overlap in restaurant visits is useful
 Big spender cues
 Cuisine as an indicator
 Review text as an indicator
48©MapR Technologies 2013- Confidential
Too Limited
 People do more than one kind of thing
 Different kinds of behaviors give different quality, quantity and
kind of information
 We don’t have to do co-occurrence
 We can do cross-occurrence
 Result is cross-recommendation
49©MapR Technologies 2013- Confidential
Heh?
51©MapR Technologies 2013- Confidential
For example
 Users enter queries (A)
– (actor = user, item=query)
 Users view videos (B)
– (actor = user, item=video)
 ATA gives query recommendation
– “did you mean to ask for”
 BTB gives video recommendation
– “you might like these videos”
52©MapR Technologies 2013- Confidential
The punch-line
 BTA recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
53©MapR Technologies 2013- Confidential
Real-life example
 Query: “Paco de Lucia”
 Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
 Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
54©MapR Technologies 2013- Confidential
Real-life example
55©MapR Technologies 2013- Confidential
Hypothetical Example
 Want a navigational ontology?
 Just put labels on a web page with traffic
– This gives A = users x label clicks
 Remember viewing history
– This gives B = users x items
 Cross recommend
– B’A = label to item mapping
 After several users click, results are whatever users think they
should be
56©MapR Technologies 2013- Confidential
57©MapR Technologies 2013- Confidential
Nice. But we
can do better?
58©MapR Technologies 2013- Confidential
Ausers
things
59©MapR Technologies 2013- Confidential
A1 A2
é
ë
ù
û
users
thing
type 1
thing
type 2
60©MapR Technologies 2013- Confidential
A1 A2
é
ë
ù
û
T
A1 A2
é
ë
ù
û=
A1
T
A2
T
é
ë
ê
ê
ù
û
ú
ú
A1 A2
é
ë
ù
û
=
A1
T
A1 A1
T
A2
AT
2A1 AT
2A2
é
ë
ê
ê
ù
û
ú
ú
r1
r2
é
ë
ê
ê
ù
û
ú
ú
=
A1
T
A1 A1
T
A2
AT
2A1 AT
2A2
é
ë
ê
ê
ù
û
ú
ú
h1
h2
é
ë
ê
ê
ù
û
ú
ú
r1 = A1
T
A1 A1
T
A2
é
ëê
ù
ûú
h1
h2
é
ë
ê
ê
ù
û
ú
ú
61©MapR Technologies 2013- Confidential
Summary
 Input: Multiple kinds of behavior on one set of things
 Output: Recommendations for one kind of behavior with a
different set of things
 Cross recommendation is a special case
62©MapR Technologies 2013- Confidential
Now again, without
the scary math
63©MapR Technologies 2013- Confidential
Input Data
 User transactions
– user id, merchant id
– SIC code, amount
– Descriptions, cuisine, …
 Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
64©MapR Technologies 2013- Confidential
Input Data
 User transactions
– user id, merchant id
– SIC code, amount
– Descriptions, cuisine, …
 Offer transactions
– user id, offer id
– vendor id, merchant id’s,
– offers, views, accepts
 Derived user data
– merchant id’s
– anomalous descriptor terms
– offer & vendor id’s
 Derived merchant data
– local top40
– SIC code
– vendor code
– amount distribution
65©MapR Technologies 2013- Confidential
Cross-recommendation
 Per merchant indicators
– merchant id’s
– chain id’s
– SIC codes
– indicator terms from text
– offer vendor id’s
 Computed by finding anomalous (indicator => merchant) rates
66©MapR Technologies 2013- Confidential
How can we deploy
this?
67©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
68©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
69©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
 Sample query
– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
70©MapR Technologies 2013- Confidential
Search-based Recommendations
 Sample document
– Merchant Id
– Field for text description
– Phone
– Address
– Location
– Indicator merchant id’s
– Indicator industry (SIC) id’s
– Indicator offers
– Indicator text
– Local top40
 Sample query
– Current location
– Recent merchant descriptions
– Recent merchant id’s
– Recent SIC codes
– Recent accepted offers
– Local top40
Original data
and meta-data
Derived from cooccurrence
and cross-occurrence
analysis
Recommendation
query
71©MapR Technologies 2013- Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
shards
Complete
history
Analyze with Map-Reduce
72©MapR Technologies 2013- Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
history
Deploy with Conventional Search System
73©MapR Technologies 2013- Confidential
Objective Results
 At a very large credit card company
 History is all transactions
 Development time to minimal viable product about 4 months
 General release 2-3 months later
 Search-based recs at or equal in quality to other techniques
74©MapR Technologies 2013- Confidential
 Contact:
– tdunning@maprtech.com
– @ted_dunning
– @apachemahout
– @user-subscribe@mahout.apache.org
 Slides and such
http://www.slideshare.net/tdunning
 Hash tags: #mapr #apachemahout #recommendations

More Related Content

PDF
GraphSage vs Pinsage #InsideArangoDB
PDF
Practical Parallel Hypergraph Algorithms | PPoPP ’20
PDF
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
PDF
Scalable and Adaptive Graph Querying with MapReduce
PPTX
Fundamentals of Image Processing & Computer Vision with MATLAB
PPTX
MATLAB & Image Processing
PDF
Low power tool paper
PDF
Web Traffic Time Series Forecasting
GraphSage vs Pinsage #InsideArangoDB
Practical Parallel Hypergraph Algorithms | PPoPP ’20
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Scalable and Adaptive Graph Querying with MapReduce
Fundamentals of Image Processing & Computer Vision with MATLAB
MATLAB & Image Processing
Low power tool paper
Web Traffic Time Series Forecasting

What's hot (15)

PPT
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
PDF
Graph Analyses with Python and NetworkX
PDF
Implementation of Low Power and Area-Efficient Carry Select Adder
PDF
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
PDF
Implementation of D* Path Planning Algorithm with NXT LEGO Mindstorms Kit for...
PDF
Hadoop scheduler with deadline constraint
PPT
Planning Evacuation Routes with the P-graph Framework
PDF
Integrative Parallel Programming in HPC
PDF
Visualizing the Model Selection Process
PDF
A Virtual Machine Placement Algorithm for Energy Efficient Cloud Resource Res...
PDF
post119s1-file3
PDF
Accurate Learning of Graph Representations with Graph Multiset Pooling
PDF
PDF
Basics of Image Processing using MATLAB
PDF
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
Graph Analyses with Python and NetworkX
Implementation of Low Power and Area-Efficient Carry Select Adder
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Implementation of D* Path Planning Algorithm with NXT LEGO Mindstorms Kit for...
Hadoop scheduler with deadline constraint
Planning Evacuation Routes with the P-graph Framework
Integrative Parallel Programming in HPC
Visualizing the Model Selection Process
A Virtual Machine Placement Algorithm for Energy Efficient Cloud Resource Res...
post119s1-file3
Accurate Learning of Graph Representations with Graph Multiset Pooling
Basics of Image Processing using MATLAB
IRJET- A Review of Approximate Adders for Energy-Efficient Digital Signal Pro...
Ad

Viewers also liked (9)

PPTX
Big Data Analytics London
PDF
Transactional Data Mining Ted Dunning 2004
PPTX
Real-time and Long-time Together
PDF
SD Forum 11 04-2010
PDF
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
PDF
Mahout classifier tour
PPTX
Big Data Lessons from the Cloud
PPTX
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
PPTX
Devoxx Real-Time Learning
Big Data Analytics London
Transactional Data Mining Ted Dunning 2004
Real-time and Long-time Together
SD Forum 11 04-2010
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Mahout classifier tour
Big Data Lessons from the Cloud
Securing Hadoop by MapR's Senior Principal Technologist Keys Botzum
Devoxx Real-Time Learning
Ad

Similar to Introduction to Mahout given at Twin Cities HUG (20)

PPTX
What's Right and Wrong with Apache Mahout
PPTX
Whats Right and Wrong with Apache Mahout
PPTX
Intro to Apache Spark by Marco Vasquez
PPTX
The power of hadoop in business
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PPTX
New directions for mahout
PPTX
Predictive Analytics San Diego
PPTX
Boston Hug by Ted Dunning 2012
PPT
Data science and OSS
PDF
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
PDF
MapReduce basics
PPTX
Data Science At Scale for IoT on the Pivotal Platform
PPTX
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
PPTX
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PPTX
New Directions for Mahout
PPTX
Which Algorithms Really Matter
PPTX
Optimal Chain Matrix Multiplication Big Data Perspective
PDF
Introduction to Spark
PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
What's Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
Intro to Apache Spark by Marco Vasquez
The power of hadoop in business
A Hands-on Intro to Data Science and R Presentation.ppt
New directions for mahout
Predictive Analytics San Diego
Boston Hug by Ted Dunning 2012
Data science and OSS
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
MapReduce basics
Data Science At Scale for IoT on the Pivotal Platform
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
New Directions for Mahout
Which Algorithms Really Matter
Optimal Chain Matrix Multiplication Big Data Perspective
Introduction to Spark
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL

Recently uploaded (20)

PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
project resource management chapter-09.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Encapsulation theory and applications.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
OMC Textile Division Presentation 2021.pptx
1. Introduction to Computer Programming.pptx
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Programs and apps: productivity, graphics, security and other tools
Web App vs Mobile App What Should You Build First.pdf
1 - Historical Antecedents, Social Consideration.pdf
project resource management chapter-09.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Encapsulation theory and applications.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Enhancing emotion recognition model for a student engagement use case through...
Assigned Numbers - 2025 - Bluetooth® Document

Introduction to Mahout given at Twin Cities HUG

  • 1. 1©MapR Technologies 2013- Confidential Introduction to Mahout And How To Build a Recommender
  • 2. 2©MapR Technologies 2013- Confidential Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Tonight Hash tag - #tchug See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR
  • 3. 3©MapR Technologies 2013- Confidential Sidebar on Drill  Apache Drill – SQL on Hadoop (and other things) – Intended to solve problems for 1-5 years from now Not the problems from 1-10 years ago – Multiple levels of API supported • SQL-2003 • Logical plan language (DAG in JSON) • Physical plan language (DAG with push-down, exchange markers) • Execution plan language (many DAG’s)  Current state – SQL 2003 support in place – Logical plan interpreter useful for testing – Value vectors near completion – High performance RPC working
  • 4. 4©MapR Technologies 2013- Confidential More on Drill  Just completed OSCON workshop  Workshop materials available shortly – Extracted technology demonstrators – Sample queries  Send me email or tweet for more info
  • 5. 5©MapR Technologies 2013- Confidential What’s Up?  What is Mahout? – Math library – Clustering, classifiers, other stuff  Recommendation – Generalities – Algorithm Specifics – System Design – Important things never mentioned  Final thoughts
  • 6. 6©MapR Technologies 2013- Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
  • 7. 7©MapR Technologies 2013- Confidential What is Mahout?  “Scalable machine learning” – not just Hadoop-oriented machine learning – not entirely, that is. Just mostly.  Components – math library – clustering – classification – decompositions – recommendations
  • 8. 8©MapR Technologies 2013- Confidential Mahout Math
  • 9. 9©MapR Technologies 2013- Confidential Mahout Math  Goals are – basic linear algebra, – and statistical sampling, – and good clustering, – decent speed, – extensibility, – especially for sparse data  But not – totally badass speed – comprehensive set of algorithms – optimization, root finders, quadrature
  • 10. 10©MapR Technologies 2013- Confidential Matrices and Vectors  At the core: – DenseVector, RandomAccessSparseVector – DenseMatrix, SparseRowMatrix  Highly composable API  Important ideas: – view*, assign and aggregate – iteration m.viewDiagonal().assign(v)
  • 11. 11©MapR Technologies 2013- Confidential Assign? View?  Why assign? – Copying is the major cost for naïve matrix packages – In-place operations critical to reasonable performance – Many kinds of updates required, so functional style very helpful  Why view? – In-place operations often required for blocks, rows, columns or diagonals – With views, we need #assign + #views methods – Without views, we need #assign x #views methods  Synergies – With both views and assign, many loops become single line
  • 12. 12©MapR Technologies 2013- Confidential Assign  Matrices  Vectors Matrix assign(double value); Matrix assign(double[][] values); Matrix assign(Matrix other); Matrix assign(DoubleFunction f); Matrix assign(Matrix other, DoubleDoubleFunction f); Vector assign(double value); Vector assign(double[] values); Vector assign(Vector other); Vector assign(DoubleFunction f); Vector assign(Vector other, DoubleDoubleFunction f); Vector assign(DoubleDoubleFunction f, double y);
  • 13. 13©MapR Technologies 2013- Confidential Views  Matrices  Vectors Matrix viewPart(int[] offset, int[] size); Matrix viewPart(int row, int rlen, int col, int clen); Vector viewRow(int row); Vector viewColumn(int column); Vector viewDiagonal(); Vector viewPart(int offset, int length);
  • 14. 14©MapR Technologies 2013- Confidential Aggregates  Matrices  Vectors double zSum(); double aggregate( DoubleDoubleFunction reduce, DoubleFunction map); double aggregate(Vector other, DoubleDoubleFunction aggregator, DoubleDoubleFunction combiner); double zSum(); Vector aggregateRows(VectorFunction f); Vector aggregateColumns(VectorFunction f); double aggregate(DoubleDoubleFunction combiner, DoubleFunction mapper);
  • 15. 15©MapR Technologies 2013- Confidential Predefined Functions  Many handy functions ABS LOG2 ACOS NEGATE ASIN RINT ATAN SIGN CEIL SIN COS SQRT EXP SQUARE FLOOR SIGMOID IDENTITY SIGMOIDGRADIENT INV TAN LOGARITHM
  • 16. 16©MapR Technologies 2013- Confidential Examples double alpha; a.assign(alpha); a.assign(b, Functions.chain( Functions.plus(beta), Functions.times(alpha)); A =a A =aB+ b
  • 17. 17©MapR Technologies 2013- Confidential Sparse Optimizations  DoubleDoubleFunction abstract properties  And Vector properties public boolean isLikeRightPlus(); public boolean isLikeLeftMult(); public boolean isLikeRightMult(); public boolean isLikeMult(); public boolean isCommutative(); public boolean isAssociative(); public boolean isAssociativeAndCommutative(); public boolean isDensifying(); public boolean isDense(); public boolean isSequentialAccess(); public double getLookupCost(); public double getIteratorAdvanceCost(); public boolean isAddConstantTime();
  • 18. 18©MapR Technologies 2013- Confidential More Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums
  • 19. 19©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums m.viewDiagonal().zSum()
  • 20. 20©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums m.viewDiagonal().zSum() m.viewDiagonal().assign(0)
  • 21. 21©MapR Technologies 2013- Confidential Examples  The trace of a matrix  Set diagonal to zero  Set diagonal to negative of row sums excluding the diagonal m.viewDiagonal().zSum() m.viewDiagonal().assign(0) Vector diag = m.viewDiagonal().assign(0); diag.assign(m.rowSums().assign(Functions.MINUS));
  • 22. 22©MapR Technologies 2013- Confidential Iteration  Matrices are Iterable in Mahout  Vectors are densely or sparsely iterable // compute both row and columns sums in one pass for (MatrixSlice row: m) { rSums.set(row.index(), row.zSum()); cSums.assign(row, Functions.PLUS); } double entropy = 0; for (Vector.Element e: v.nonZeroes()) { entropy += e.get() * Math.log(e.get()); }
  • 23. 23©MapR Technologies 2013- Confidential Random Sampling  Samples from some type  Lots of kinds ChineseRestaurant Missing Normal Empirical Multinomial PoissonSampler IndianBuffet MultiNormal Sampler public interface Sampler<T> { T sample(); } public abstract class AbstractSamplerFunction extends DoubleFunction implements Sampler<Double>
  • 24. 24©MapR Technologies 2013- Confidential Clustering and Such  Streaming k-means and ball k-means – streaming reduces very large data to a cluster sketch – ball k-means is a high quality k-means implementation – the cluster sketch is also usable for other applications – single machine threaded and map-reduce versions available  SVD and friends – stochastic SVD has in-memory, single machine out-of-core and map-reduce versions – good for reducing very large sparse matrices to tall skinny dense ones  Spectral clustering – based on SVD, allows massive dimensional clustering
  • 25. 25©MapR Technologies 2013- Confidential Mahout Math Summary  Matrices, Vectors – views – in-place assignment – aggregations – iterations  Functions – lots built-in – cooperate with sparse vector optimizations  Sampling – abstract samplers – samplers as functions  Other stuff … clustering, SVD
  • 26. 26©MapR Technologies 2013- Confidential Recommenders
  • 27. 27©MapR Technologies 2013- Confidential Recommendations  Often known as collaborative filtering  Actors interact with items – observe successful interaction  We want to suggest additional successful interactions  Observations inherently very sparse
  • 28. 28©MapR Technologies 2013- Confidential The Big Ideas  Cooccurrence is the core operation (and it is pretty simple)  Cooccurrence can be extended to handle important new capabilities  Recommendation systems can be deployed ideally using search technology
  • 29. 29©MapR Technologies 2013- Confidential Examples of Recommendations  Customers buying books (Linden et al)  Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)  Internet radio listeners not skipping songs (Musicmatch)  Internet video watchers watching >30 s (Veoh)  Visibility in a map UI (new Google maps)
  • 30. 30©MapR Technologies 2013- Confidential A simple recommendation architecture  Look at the history of interactions  Find significant item cooccurrence in user histories  Use these cooccurring items as “indicators”  For all indicators in user history, accumulate scores for related items
  • 31. 31©MapR Technologies 2013- Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  • 32. 32©MapR Technologies 2013- Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  • 33. 33©MapR Technologies 2013- Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  • 34. 34©MapR Technologies 2013- Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  • 35. 35©MapR Technologies 2013- Confidential Problems with Raw Cooccurrence  Very popular items co-occur with everything – Welcome document – Elevator music  That isn’t interesting – We want anomalous cooccurrence
  • 36. 36©MapR Technologies 2013- Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2 t3 not t3 t1 2 1 not t1 1 1
  • 37. 37©MapR Technologies 2013- Confidential Spot the Anomaly  Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.44 0.98 2.26 7.15
  • 38. 39©MapR Technologies 2013- Confidential Threshold by Score  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  • 39. 40©MapR Technologies 2013- Confidential Threshold by Score  Significant cooccurrence => Indicators t1 t2 t3 t4 t1 1 0 0 1 t2 0 1 0 1 t3 0 0 1 1 t4 1 0 0 1
  • 40. 41©MapR Technologies 2013- Confidential So Far, So Good  Classic recommendation systems based on these approaches – Musicmatch (ca 2000) – Veoh Networks (ca 2005)  Currently available in Mahout – See RowSimilarityJob  Very simple to deploy – Compute indicators – Store in search engine – Works very well with enough data
  • 41. 42©MapR Technologies 2013- Confidential What’s right about this?
  • 42. 43©MapR Technologies 2013- Confidential Virtues of Current State of the Art  Lots of well publicized history – Musicmatch, Veoh, Netflix, Amazon, Overstock  Lots of support – Mahout, commercial offerings like Myrrix  Lots of existing code – Mahout, commercial codes  Proven track record  Well socialized solution
  • 43. 44©MapR Technologies 2013- Confidential What’s wrong about this?
  • 44. 45©MapR Technologies 2013- Confidential Problems for Recommenders  Cold start  Disjoint populations  Long tail  Multiple kinds of evidence (multi-modal recommendations) – unstructured add-on data – other transaction streams – textual descriptions
  • 45. 46©MapR Technologies 2013- Confidential What is this multi-modal stuff?  But people don’t just do one thing  One kind of behavior is useful for predicting other kinds  Having a complete picture is important for accuracy  What has the user said, viewed, clicked, closed, bought lately?
  • 46. 47©MapR Technologies 2013- Confidential Example Multi-modal Inputs  Overlap in restaurant visits is useful  Big spender cues  Cuisine as an indicator  Review text as an indicator
  • 47. 48©MapR Technologies 2013- Confidential Too Limited  People do more than one kind of thing  Different kinds of behaviors give different quality, quantity and kind of information  We don’t have to do co-occurrence  We can do cross-occurrence  Result is cross-recommendation
  • 48. 49©MapR Technologies 2013- Confidential Heh?
  • 49. 51©MapR Technologies 2013- Confidential For example  Users enter queries (A) – (actor = user, item=query)  Users view videos (B) – (actor = user, item=video)  ATA gives query recommendation – “did you mean to ask for”  BTB gives video recommendation – “you might like these videos”
  • 50. 52©MapR Technologies 2013- Confidential The punch-line  BTA recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 51. 53©MapR Technologies 2013- Confidential Real-life example  Query: “Paco de Lucia”  Conventional meta-data search results: – “hombres del paco” times 400 – not much else  Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 52. 54©MapR Technologies 2013- Confidential Real-life example
  • 53. 55©MapR Technologies 2013- Confidential Hypothetical Example  Want a navigational ontology?  Just put labels on a web page with traffic – This gives A = users x label clicks  Remember viewing history – This gives B = users x items  Cross recommend – B’A = label to item mapping  After several users click, results are whatever users think they should be
  • 55. 57©MapR Technologies 2013- Confidential Nice. But we can do better?
  • 56. 58©MapR Technologies 2013- Confidential Ausers things
  • 57. 59©MapR Technologies 2013- Confidential A1 A2 é ë ù û users thing type 1 thing type 2
  • 58. 60©MapR Technologies 2013- Confidential A1 A2 é ë ù û T A1 A2 é ë ù û= A1 T A2 T é ë ê ê ù û ú ú A1 A2 é ë ù û = A1 T A1 A1 T A2 AT 2A1 AT 2A2 é ë ê ê ù û ú ú r1 r2 é ë ê ê ù û ú ú = A1 T A1 A1 T A2 AT 2A1 AT 2A2 é ë ê ê ù û ú ú h1 h2 é ë ê ê ù û ú ú r1 = A1 T A1 A1 T A2 é ëê ù ûú h1 h2 é ë ê ê ù û ú ú
  • 59. 61©MapR Technologies 2013- Confidential Summary  Input: Multiple kinds of behavior on one set of things  Output: Recommendations for one kind of behavior with a different set of things  Cross recommendation is a special case
  • 60. 62©MapR Technologies 2013- Confidential Now again, without the scary math
  • 61. 63©MapR Technologies 2013- Confidential Input Data  User transactions – user id, merchant id – SIC code, amount – Descriptions, cuisine, …  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts
  • 62. 64©MapR Technologies 2013- Confidential Input Data  User transactions – user id, merchant id – SIC code, amount – Descriptions, cuisine, …  Offer transactions – user id, offer id – vendor id, merchant id’s, – offers, views, accepts  Derived user data – merchant id’s – anomalous descriptor terms – offer & vendor id’s  Derived merchant data – local top40 – SIC code – vendor code – amount distribution
  • 63. 65©MapR Technologies 2013- Confidential Cross-recommendation  Per merchant indicators – merchant id’s – chain id’s – SIC codes – indicator terms from text – offer vendor id’s  Computed by finding anomalous (indicator => merchant) rates
  • 64. 66©MapR Technologies 2013- Confidential How can we deploy this?
  • 65. 67©MapR Technologies 2013- Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location
  • 66. 68©MapR Technologies 2013- Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40
  • 67. 69©MapR Technologies 2013- Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40  Sample query – Current location – Recent merchant descriptions – Recent merchant id’s – Recent SIC codes – Recent accepted offers – Local top40
  • 68. 70©MapR Technologies 2013- Confidential Search-based Recommendations  Sample document – Merchant Id – Field for text description – Phone – Address – Location – Indicator merchant id’s – Indicator industry (SIC) id’s – Indicator offers – Indicator text – Local top40  Sample query – Current location – Recent merchant descriptions – Recent merchant id’s – Recent SIC codes – Recent accepted offers – Local top40 Original data and meta-data Derived from cooccurrence and cross-occurrence analysis Recommendation query
  • 69. 71©MapR Technologies 2013- Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Complete history Analyze with Map-Reduce
  • 70. 72©MapR Technologies 2013- Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history Deploy with Conventional Search System
  • 71. 73©MapR Technologies 2013- Confidential Objective Results  At a very large credit card company  History is all transactions  Development time to minimal viable product about 4 months  General release 2-3 months later  Search-based recs at or equal in quality to other techniques
  • 72. 74©MapR Technologies 2013- Confidential  Contact: – tdunning@maprtech.com – @ted_dunning – @apachemahout – @user-subscribe@mahout.apache.org  Slides and such http://www.slideshare.net/tdunning  Hash tags: #mapr #apachemahout #recommendations