Paris data-geeks-2013-03-28

Practical Machine Learning

with Mahout

whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill

(we’re hiring)

• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning

Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering

What Works at Scale
• Logging
• Counting
• Session grouping

What Works at Scale
• Logging
• Counting

• Really. Don’t bet on anything much more
complex than these

What Works at Scale
• Logging
• Counting

• Really. Don’t bet on anything much more
complex than these
• These are harder than they look

Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
bought y”

• But soooo much more is possible

Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
(Musicmatch)
• Internet video watchers watching >30 s

Dyadic Structure
• Functional
– Interaction: actor -> item*
• Relational
– Interaction ⊆ Actors x Items
• Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
• Predict missing observations

Recommendations Analysis
• R(x,y) = # people who bought x also bought y

select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y

• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y


Rij = å A ui Buj
u

=A BT

Fundamental Algorithmic Structure
• Cooccurrence
K=A A T

• Matrix approximation by factoring
A » USV T

K » VS2 VT
r = VS V h
2 T

• LLR
r = sparsify(A A)h
T

But Wait!
• Cooccurrence
K=A A T

• Cross occurrence

K=B A T

For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• A’A gives query recommendation
– “did you mean to ask for”
• B’B gives video recommendation
– “you might like these videos”

The punch-line
• B’A recommends videos in response to a
query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)

Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff

Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever
users think they should be

What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue

Diagonalized Cluster Proximity

Clusters as Distribution Surrogate

For Example
1
D (X) >
2
D (X)
2

s
4 2 5

Grouping these
two clusters
seriously hurts
squared distance

Typical k-means Failure

Selecting two seeds
here cannot be
fixed with Lloyds

Result is that these two
clusters get glued
together

Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time

Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time

• But for big data, k gets large

Surrogate Method
• Start with sloppy clustering into lots of
clusters
κ = k log n clusters
• Use this sketch as a weighted surrogate for the
data
• Results are provably good for highly
clusterable data

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice

Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger
k, looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice

Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal

How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold

But Wait, …
• Finding nearest centroid is inner loop

• This could take O( d κ ) per point and κ can be
big

• Happily, approximate nearest centroid works
fine

Projection Search
total ordering!

1
LSH Bit-match Versus Cosine
0.8

0.6

0.4

0.2
Y Ax is

0
0 8 16 24 32 40 48 56 64

- 0.2

- 0.4

- 0.6

- 0.8

-1

X Ax is

Parallel Speedup?
200

Non- threaded

✓
100
2
Tim e per point (μs)

Threaded version
3

50
4
40 6
5

8
30
10 14
12
20 Perfect Scaling 16

10
1 2 3 4 5 20

Threads

Quality
• Ball k-means implementation appears significantly
better than simple k-means

• Streaming k-means + ball k-means appears to be about
as good as ball k-means alone

• All evaluations on 20 newsgroups with held-out data

• Figure of merit is mean and median squared distance
to nearest cluster

Contact Me!
• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Get the code as part of Mahout trunk (or 0.8 very soon)

• Contact me at tdunning@maprtech.com or @ted_dunning

• Share news with @apachemahout

Paris data-geeks-2013-03-28

More Related Content

What's hot (9)

Viewers also liked (12)

Similar to Paris data-geeks-2013-03-28 (20)

More from Ted Dunning (20)

Recently uploaded (20)

Paris data-geeks-2013-03-28