SlideShare a Scribd company logo
Practical Machine Learning
with Mahout
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
• Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning
Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering
What Works at Scale
• Logging
• Counting
• Session grouping
What Works at Scale
• Logging
• Counting
• Session grouping
• Really. Don’t bet on anything much more
complex than these
What Works at Scale
• Logging
• Counting
• Session grouping
• Really. Don’t bet on anything much more
complex than these
• These are harder than they look
Recommendations
Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
bought y”
• But soooo much more is possible
Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
(Musicmatch)
• Internet video watchers watching >30 s
Dyadic Structure
• Functional
– Interaction: actor -> item*
• Relational
– Interaction ⊆ Actors x Items
• Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
• Predict missing observations
Recommendations Analysis
• R(x,y) = # people who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
Recommendations Analysis
Rij = AuiBuju
å
= AT
B
Fundamental Algorithmic Structure
• Cooccurrence
• Matrix approximation by factoring
• LLR
K = AT
A
A » USVT
K » VS2
VT
r = VS2
VT
h
r =sparsify(AT
A)h
But Wait!
• Cooccurrence
• Cross occurrence
K = AT
A
K = BT
A
For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• A’A gives query recommendation
– “did you mean to ask for”
• B’B gives video recommendation
– “you might like these videos”
The punch-line
• B’A recommends videos in response to a
query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-
data)
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
Real-life example
Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever
users think they should be
Super-fast k-means Clustering
RATIONALE
What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue
An Example
An Example
Diagonalized Cluster Proximity
Clusters as Distribution Surrogate
Clusters as Distribution Surrogate
THEORY
For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)
ALGORITHMS
Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time
• But for big data, k gets large
Surrogate Method
• Start with sloppy clustering into lots of
clusters
κ = k log n clusters
• Use this sketch as a weighted surrogate for the
data
• Results are provably good for highly
clusterable data
Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice
Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k,
looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice
Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
IMPLEMENTATION
But Wait, …
• Finding nearest centroid is inner loop
• This could take O( d κ ) per point and κ can be
big
• Happily, approximate nearest centroid works
fine
Projection Search
total ordering!
LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis
RESULTS
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
Quality
• Ball k-means implementation appears significantly
better than simple k-means
• Streaming k-means + ball k-means appears to be about
as good as ball k-means alone
• All evaluations on 20 newsgroups with held-out data
• Figure of merit is mean and median squared distance
to nearest cluster
Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Get the code as part of Mahout trunk (or 0.8 very soon)
• Contact me at tdunning@maprtech.com or @ted_dunning
• Share news with @apachemahout

More Related Content

PPTX
Paris data-geeks-2013-03-28
PPTX
Nearest Neighbor Customer Insight
PPTX
ACM 2013-02-25
PDF
Matching Dirty Data
PDF
A Walk in the GAN Zoo
PPTX
Locality sensitive hashing
PPTX
Assignment of pseudo code
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
Paris data-geeks-2013-03-28
Nearest Neighbor Customer Insight
ACM 2013-02-25
Matching Dirty Data
A Walk in the GAN Zoo
Locality sensitive hashing
Assignment of pseudo code
Mining Frequent Closed Graphs on Evolving Data Streams

What's hot (15)

PDF
Building Scalable Semantic Geospatial RDF Stores
PDF
Generative adversarial networks
PDF
Encoding survey
PDF
Querying Temporal Databases via OWL 2 QL
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PDF
Bringing Algebraic Semantics to Mahout
PDF
Project - Deep Locality Sensitive Hashing
PDF
Generative adversarial text to image synthesis
ODP
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PDF
Representing and Querying Geospatial Information in the Semantic Web
PPTX
Mahout scala and spark bindings
PPTX
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
PDF
Jvm heap
PPTX
Grill at bigdata-cloud conf
PDF
強化学習の分散アーキテクチャ変遷
Building Scalable Semantic Geospatial RDF Stores
Generative adversarial networks
Encoding survey
Querying Temporal Databases via OWL 2 QL
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Bringing Algebraic Semantics to Mahout
Project - Deep Locality Sensitive Hashing
Generative adversarial text to image synthesis
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Representing and Querying Geospatial Information in the Semantic Web
Mahout scala and spark bindings
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Jvm heap
Grill at bigdata-cloud conf
強化学習の分散アーキテクチャ変遷
Ad

Viewers also liked (7)

PPTX
Machine Learning with Hadoop Boston hug 2012
PPTX
London Data Science - Super-Fast Clustering Report
PPTX
Oscon Data 2011 Ted Dunning
PDF
Intelligent Search
PPTX
Hug france-2012-12-04
PPTX
Technical Overview of Apache Drill by Jacques Nadeau
PPTX
Big Data Paris
Machine Learning with Hadoop Boston hug 2012
London Data Science - Super-Fast Clustering Report
Oscon Data 2011 Ted Dunning
Intelligent Search
Hug france-2012-12-04
Technical Overview of Apache Drill by Jacques Nadeau
Big Data Paris
Ad

Similar to Paris Data Geeks (20)

PPTX
Fast Single-pass K-means Clusterting at Oxford
PPTX
Clustering - ACM 2013 02-25
PPTX
Oxford 05-oct-2012
PDF
clustering unsupervised learning and machine learning.pdf
PPTX
Deep learning from mashine learning AI..
PPTX
Clustering: A Scikit Learn Tutorial
PDF
Machine Learning Algorithms Introduction.pdf
PDF
Number Crunching in Python
PDF
How to interactively visualise and explore a billion objects (wit vaex)
PDF
Vaex talk-pydata-paris
PPTX
Lecture 8 about data mining and how to use it.pptx
PDF
managing big data
PPTX
Data Mining Lecture_10(b).pptx
PDF
Realtime Analytics
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
K Nearest neighbour
PPTX
A Workshop on R
PPTX
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
PPTX
Skillwise Big data
PDF
The ARK Identifier Scheme at Ten Years Old
Fast Single-pass K-means Clusterting at Oxford
Clustering - ACM 2013 02-25
Oxford 05-oct-2012
clustering unsupervised learning and machine learning.pdf
Deep learning from mashine learning AI..
Clustering: A Scikit Learn Tutorial
Machine Learning Algorithms Introduction.pdf
Number Crunching in Python
How to interactively visualise and explore a billion objects (wit vaex)
Vaex talk-pydata-paris
Lecture 8 about data mining and how to use it.pptx
managing big data
Data Mining Lecture_10(b).pptx
Realtime Analytics
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
K Nearest neighbour
A Workshop on R
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
Skillwise Big data
The ARK Identifier Scheme at Ten Years Old

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
sap open course for s4hana steps from ECC to s4
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Paris Data Geeks

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring) • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • What works at scale • Recommendation • Unsupervised - Clustering
  • 4. What Works at Scale • Logging • Counting • Session grouping
  • 5. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these
  • 6. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these • These are harder than they look
  • 8. Recommendations • Special case of reflected intelligence • Traditionally “people who bought x also bought y” • But soooo much more is possible
  • 9. Examples • Customers buying books (Linden et al) • Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix) • Internet radio listeners not skipping songs (Musicmatch) • Internet video watchers watching >30 s
  • 10. Dyadic Structure • Functional – Interaction: actor -> item* • Relational – Interaction ⊆ Actors x Items • Matrix – Rows indexed by actor, columns by item – Value is count of interactions • Predict missing observations
  • 11. Recommendations Analysis • R(x,y) = # people who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 12. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 13. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 14. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 15. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 16. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 17. Recommendations Analysis Rij = AuiBuju å = AT B
  • 18. Fundamental Algorithmic Structure • Cooccurrence • Matrix approximation by factoring • LLR K = AT A A » USVT K » VS2 VT r = VS2 VT h r =sparsify(AT A)h
  • 19. But Wait! • Cooccurrence • Cross occurrence K = AT A K = BT A
  • 20. For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • A’A gives query recommendation – “did you mean to ask for” • B’B gives video recommendation – “you might like these videos”
  • 21. The punch-line • B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta- data)
  • 22. Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres del paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 24. Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 27. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization is critical • Agreement to “gold standard” is a non-issue
  • 34. For Example Grouping these two clusters seriously hurts squared distance D4 2 (X) > 1 s 2 D5 2 (X)
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 37. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 38. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 39. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O( nkd + k3d ) time • But for big data, k gets large
  • 40. Surrogate Method • Start with sloppy clustering into lots of clusters κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Results are provably good for highly clusterable data
  • 41. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 42. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice
  • 43. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 44. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 45. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 47. But Wait, … • Finding nearest centroid is inner loop • This could take O( d κ ) per point and κ can be big • Happily, approximate nearest centroid works fine
  • 49. LSH Bit-match Versus Cosine 0 8 16 24 32 40 48 56 64 1 - 1 - 0.8 - 0.6 - 0.4 - 0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis
  • 51. Parallel Speedup? 1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Timeperpoint(μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non- threaded Perfect Scaling ✓
  • 52. Quality • Ball k-means implementation appears significantly better than simple k-means • Streaming k-means + ball k-means appears to be about as good as ball k-means alone • All evaluations on 20 newsgroups with held-out data • Figure of merit is mean and median squared distance to nearest cluster
  • 53. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Get the code as part of Mahout trunk (or 0.8 very soon) • Contact me at tdunning@maprtech.com or @ted_dunning • Share news with @apachemahout