Paris Data Geeks

Practical Machine Learning
with Mahout

whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning

Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering

What Works at Scale
• Logging
• Counting
• Session grouping

What Works at Scale
• Logging
• Counting
• Really. Don’t bet on anything much more
complex than these

What Works at Scale
• Logging
• Counting
• Really. Don’t bet on anything much more
complex than these
• These are harder than they look

Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
bought y”
• But soooo much more is possible

Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
(Musicmatch)
• Internet video watchers watching >30 s

Dyadic Structure
• Functional
– Interaction: actor -> item*
• Relational
– Interaction ⊆ Actors x Items
• Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
• Predict missing observations

Recommendations Analysis
• R(x,y) = # people who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y

• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y

Rij = AuiBuju
å
= AT
B

Fundamental Algorithmic Structure
• Cooccurrence
• Matrix approximation by factoring
• LLR
K = AT
A
A » USVT
K » VS2
VT
r = VS2
VT
h
r =sparsify(AT
A)h

But Wait!
• Cooccurrence
• Cross occurrence
K = AT
A
K = BT
A

For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• A’A gives query recommendation
– “did you mean to ask for”
• B’B gives video recommendation
– “you might like these videos”

The punch-line
• B’A recommends videos in response to a
query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-
data)

Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff

Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever
users think they should be

What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue

Diagonalized Cluster Proximity

Clusters as Distribution Surrogate

For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)

Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time
• But for big data, k gets large

Surrogate Method
• Start with sloppy clustering into lots of
clusters
κ = k log n clusters
• Use this sketch as a weighted surrogate for the
data
• Results are provably good for highly
clusterable data

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k,
looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice

Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal

How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold

But Wait, …
• Finding nearest centroid is inner loop
• This could take O( d κ ) per point and κ can be
big
• Happily, approximate nearest centroid works
fine

Projection Search
total ordering!

LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Quality
• Ball k-means implementation appears significantly
better than simple k-means
• Streaming k-means + ball k-means appears to be about
as good as ball k-means alone
• All evaluations on 20 newsgroups with held-out data
• Figure of merit is mean and median squared distance
to nearest cluster

Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Get the code as part of Mahout trunk (or 0.8 very soon)
• Contact me at tdunning@maprtech.com or @ted_dunning
• Share news with @apachemahout

Paris Data Geeks

More Related Content

What's hot (15)

Viewers also liked (7)

Similar to Paris Data Geeks (20)

More from MapR Technologies (20)

Recently uploaded (20)

Paris Data Geeks