SlideShare a Scribd company logo
Practical Machine Learning

        with Mahout
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software Foundation
  – particularly Mahout, Zookeeper and Drill

     (we’re hiring)

• Contact me at
 tdunning@maprtech.com
 tdunning@apache.com
 ted.dunning@gmail.com
 @ted_dunning
Agenda
• What works at scale
• Recommendation
• Unsupervised - Clustering
What Works at Scale
• Logging
• Counting
• Session grouping
What Works at Scale
• Logging
• Counting
• Session grouping

• Really. Don’t bet on anything much more
  complex than these
What Works at Scale
• Logging
• Counting
• Session grouping

• Really. Don’t bet on anything much more
  complex than these
• These are harder than they look
Recommendations
Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
  bought y”

• But soooo much more is possible
Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
  Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
  (Musicmatch)
• Internet video watchers watching >30 s
Dyadic Structure
• Functional
  – Interaction: actor -> item*
• Relational
  – Interaction ⊆ Actors x Items
• Matrix
  – Rows indexed by actor, columns by item
  – Value is count of interactions
• Predict missing observations
Recommendations Analysis
• R(x,y) = # people who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis
• R(x,y) = People who bought x also bought y

select x, y, count(*) from (
     (select distinct(user_id, item_id) as x from log) A
     join
     (select distinct(user_id, item_id) as y from log) B
     on user_id
) group by x, y
Recommendations Analysis


   Rij = å A ui Buj
            u

       =A BT
Fundamental Algorithmic Structure
• Cooccurrence
        K=A A T

• Matrix approximation by factoring
       A » USV  T


        K » VS2 VT
        r = VS V h
              2   T

• LLR
        r = sparsify(A A)h
                       T
But Wait!
• Cooccurrence
      K=A A   T

• Cross occurrence

       K=B A  T
For example
• Users enter queries (A)
  – (actor = user, item=query)
• Users view videos (B)
  – (actor = user, item=video)
• A’A gives query recommendation
  – “did you mean to ask for”
• B’B gives video recommendation
  – “you might like these videos”
The punch-line
• B’A recommends videos in response to a
  query
  – (isn’t that a search engine?)
  – (not quite, it doesn’t look at content or meta-data)
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
  – “hombres del paco” times 400
  – not much else
• Recommendation based search:
  – Flamenco guitar and dancers
  – Spanish and classical guitar
  – Van Halen doing a classical/flamenco riff
Real-life example
Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
  – This gives A = users x label clicks
• Remember viewing history
  – This gives B = users x items
• Cross recommend
  – B’A = label to item mapping
• After several users click, results are whatever
  users think they should be
Super-fast k-means Clustering
RATIONALE
What is Quality?
• Robust clustering not a goal
  – we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue
An Example
An Example
Diagonalized Cluster Proximity
Clusters as Distribution Surrogate
Clusters as Distribution Surrogate
THEORY
For Example
            1
  D (X) >
   2
                    D (X)
                     2

            s
   4            2    5




        Grouping these
           two clusters
         seriously hurts
       squared distance
ALGORITHMS
Typical k-means Failure

  Selecting two seeds
       here cannot be
     fixed with Lloyds

                 Result is that these two
                       clusters get glued
                                 together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
  clusters
• Avoids outliers in centroid computation

  initialize centroids randomly with distance maximizing
  tendency
  for each of a very few iterations:
    for each data point:
        assign point to nearest cluster
    recompute centroids using only points much closer than
  closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
  exponentially with k
• Alternative strategy has high probability of
  success, but takes O(nkd + k3d) time
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
  exponentially with k
• Alternative strategy has high probability of
  success, but takes O( nkd + k3d ) time

• But for big data, k gets large
Surrogate Method
• Start with sloppy clustering into lots of
  clusters
     κ = k log n clusters
• Use this sketch as a weighted surrogate for the
  data
• Results are provably good for highly
  clusterable data
Algorithm Costs
• Surrogate methods
  – fast, sloppy single pass clustering with κ = k log n
  – fast sloppy search for nearest cluster,
     O(d log κ) = O(d (log k + log log n)) per point
  – fast, in-memory, high-quality clustering of κ weighted
    centroids
     O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
     O(κ d log k) or O(d log κ log k) for larger k, looser quality
  – result is k high-quality centroids
     • Even the sloppy surrogate may suffice
Algorithm Costs
• Surrogate methods
  – fast, sloppy single pass clustering with κ = k log n
  – fast sloppy search for nearest cluster,
     O(d log κ) = O(d ( log k + log log n )) per point
  – fast, in-memory, high-quality clustering of κ weighted
    centroids
     O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
     O(κ d log k) or O( d log k ( log k + log log n ) ) for larger
     k, looser quality
  – result is k high-quality centroids
     • For many purposes, even the sloppy surrogate may suffice
Algorithm Costs
• How much faster for the sketch phase?
  – take k = 2000, d = 10, n = 100,000
  – k d log n = 2000 x 10 x 26 = 500,000
  – d (log k + log log n) = 10(11 + 5) = 170
  – 3,000 times faster is a bona fide big deal
Algorithm Costs
• How much faster for the sketch phase?
  – take k = 2000, d = 10, n = 100,000
  – k d log n = 2000 x 10 x 26 = 500,000
  – d (log k + log log n) = 10(11 + 5) = 170
  – 3,000 times faster is a bona fide big deal
How It Works
• For each point
  – Find approximately nearest centroid (distance = d)
  – If (d > threshold) new centroid
  – Else if (u > d/threshold) new cluster
  – Else add to nearest centroid
• If centroids > κ ≈ C log N
  – Recursively cluster centroids with higher threshold
IMPLEMENTATION
But Wait, …
• Finding nearest centroid is inner loop

• This could take O( d κ ) per point and κ can be
  big

• Happily, approximate nearest centroid works
  fine
Projection Search
               total ordering!
1
                  LSH Bit-match Versus Cosine
           0.8


           0.6


           0.4


           0.2
Y Ax is




             0
                  0   8   16   24    32       40   48   56   64

          - 0.2


          - 0.4


          - 0.6


          - 0.8


            -1

                                    X Ax is
RESULTS
Parallel Speedup?
                       200


                                                                      Non- threaded




                                                    ✓
                       100
                                    2
Tim e per point (μs)




                                                                       Threaded version
                                            3

                       50
                                                      4
                       40                                                6
                                                              5

                                                                               8
                       30
                                                                                   10        14
                                                                                        12
                       20                       Perfect Scaling                                   16




                       10
                             1          2       3         4       5                                    20


                                                    Threads
Quality
• Ball k-means implementation appears significantly
  better than simple k-means

• Streaming k-means + ball k-means appears to be about
  as good as ball k-means alone

• All evaluations on 20 newsgroups with held-out data

• Figure of merit is mean and median squared distance
  to nearest cluster
Contact Me!
• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Get the code as part of Mahout trunk (or 0.8 very soon)

• Contact me at tdunning@maprtech.com or @ted_dunning


• Share news with @apachemahout

More Related Content

PPTX
ACM 2013-02-25
PPTX
Nearest Neighbor Customer Insight
PDF
Ben Carterett — Advances in Information Retrieval Evaluation
PDF
Fingerprinting Chemical Structures
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
PDF
Generative adversarial networks
PDF
Introduction to Deep Learning with Python
PDF
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
ACM 2013-02-25
Nearest Neighbor Customer Insight
Ben Carterett — Advances in Information Retrieval Evaluation
Fingerprinting Chemical Structures
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
Generative adversarial networks
Introduction to Deep Learning with Python
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...

What's hot (9)

PDF
Big Data Processing using Apache Spark and Clojure
PPTX
Deep Learning for AI (2)
PDF
Java 8 - Return of the Java
PDF
Migrating from matlab to python
PDF
[系列活動] 手把手的深度學習實務
PPTX
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
PDF
Generative Adversarial Networks 2
PDF
Foilsを使ってみた。
PDF
Deep Generative Models
Big Data Processing using Apache Spark and Clojure
Deep Learning for AI (2)
Java 8 - Return of the Java
Migrating from matlab to python
[系列活動] 手把手的深度學習實務
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
Generative Adversarial Networks 2
Foilsを使ってみた。
Deep Generative Models
Ad

Viewers also liked (12)

PPT
Couchbase Server 2.0 - Indexing and Querying - Deep dive
PPTX
Development Platform as a Service - erfarenheter efter ett års användning - ...
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
OpenStack Heat slides
PDF
Cassandra at Instagram (August 2013)
PDF
A user's perspective on SaltStack and other configuration management tools
PDF
storm at twitter
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PPTX
Building Your First App with MongoDB
PPTX
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
PDF
Recommender system algorithm and architecture
Couchbase Server 2.0 - Indexing and Querying - Deep dive
Development Platform as a Service - erfarenheter efter ett års användning - ...
Big Data Analysis Patterns - TriHUG 6/27/2013
OpenStack Heat slides
Cassandra at Instagram (August 2013)
A user's perspective on SaltStack and other configuration management tools
storm at twitter
Introduction to Apache Airflow - Data Day Seattle 2016
Building Your First App with MongoDB
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Recommender system algorithm and architecture
Ad

Similar to Paris data-geeks-2013-03-28 (20)

PPTX
Oxford 05-oct-2012
PPTX
Paris Data Geeks
PPTX
Graphlab dunning-clustering
PPTX
Fast Single-pass K-means Clusterting at Oxford
PPTX
Clustering - ACM 2013 02-25
PPT
Data miningpresentation
PDF
How to Find Relevant Data for Effort Estimation
PDF
Bayesian Counters
PDF
Incremental Item-based Collaborative Filtering
PPTX
London data science
PPTX
New Directions for Mahout
PPT
Clustering_Unsupervised learning Unsupervised learning.ppt
PPTX
Super-Fast Clustering Report in MapR
PDF
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PDF
Scalable Similarity-Based Neighborhood Methods with MapReduce
PDF
The Robust Optimization of Non-Linear Requirements Models
PPT
multiarmed bandit.ppt
PDF
Local vs. Global Models for Effort Estimation and Defect Prediction
PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Oxford 05-oct-2012
Paris Data Geeks
Graphlab dunning-clustering
Fast Single-pass K-means Clusterting at Oxford
Clustering - ACM 2013 02-25
Data miningpresentation
How to Find Relevant Data for Effort Estimation
Bayesian Counters
Incremental Item-based Collaborative Filtering
London data science
New Directions for Mahout
Clustering_Unsupervised learning Unsupervised learning.ppt
Super-Fast Clustering Report in MapR
84cc04ff77007e457df6aa2b814d2346bf1b
Scalable Similarity-Based Neighborhood Methods with MapReduce
The Robust Optimization of Non-Linear Requirements Models
multiarmed bandit.ppt
Local vs. Global Models for Effort Estimation and Defect Prediction
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...

More from Ted Dunning (20)

PPTX
Dunning - SIGMOD - Data Economy.pptx
PPTX
How to Get Going with Kubernetes
PPTX
Progress for big data in Kubernetes
PPTX
Anomaly Detection: How to find what you didn’t know to look for
PPTX
Streaming Architecture including Rendezvous for Machine Learning
PPTX
Machine Learning Logistics
PPTX
Tensor Abuse - how to reuse machine learning frameworks
PPTX
Machine Learning logistics
PPTX
T digest-update
PPTX
Finding Changes in Real Data
PPTX
Where is Data Going? - RMDC Keynote
PPTX
Real time-hadoop
PPTX
Cheap learning-dunning-9-18-2015
PPTX
Sharing Sensitive Data Securely
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
How the Internet of Things is Turning the Internet Upside Down
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
PPTX
Dunning time-series-2015
PPTX
Doing-the-impossible
PPTX
Anomaly Detection - New York Machine Learning
Dunning - SIGMOD - Data Economy.pptx
How to Get Going with Kubernetes
Progress for big data in Kubernetes
Anomaly Detection: How to find what you didn’t know to look for
Streaming Architecture including Rendezvous for Machine Learning
Machine Learning Logistics
Tensor Abuse - how to reuse machine learning frameworks
Machine Learning logistics
T digest-update
Finding Changes in Real Data
Where is Data Going? - RMDC Keynote
Real time-hadoop
Cheap learning-dunning-9-18-2015
Sharing Sensitive Data Securely
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
How the Internet of Things is Turning the Internet Upside Down
Apache Kylin - OLAP Cubes for SQL on Hadoop
Dunning time-series-2015
Doing-the-impossible
Anomaly Detection - New York Machine Learning

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Teaching material agriculture food technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Teaching material agriculture food technology
20250228 LYD VKU AI Blended-Learning.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Chapter 3 Spatial Domain Image Processing.pdf
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Paris data-geeks-2013-03-28

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill (we’re hiring) • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • What works at scale • Recommendation • Unsupervised - Clustering
  • 4. What Works at Scale • Logging • Counting • Session grouping
  • 5. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these
  • 6. What Works at Scale • Logging • Counting • Session grouping • Really. Don’t bet on anything much more complex than these • These are harder than they look
  • 8. Recommendations • Special case of reflected intelligence • Traditionally “people who bought x also bought y” • But soooo much more is possible
  • 9. Examples • Customers buying books (Linden et al) • Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix) • Internet radio listeners not skipping songs (Musicmatch) • Internet video watchers watching >30 s
  • 10. Dyadic Structure • Functional – Interaction: actor -> item* • Relational – Interaction ⊆ Actors x Items • Matrix – Rows indexed by actor, columns by item – Value is count of interactions • Predict missing observations
  • 11. Recommendations Analysis • R(x,y) = # people who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 12. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 13. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 14. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 15. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 16. Recommendations Analysis • R(x,y) = People who bought x also bought y select x, y, count(*) from ( (select distinct(user_id, item_id) as x from log) A join (select distinct(user_id, item_id) as y from log) B on user_id ) group by x, y
  • 17. Recommendations Analysis Rij = å A ui Buj u =A BT
  • 18. Fundamental Algorithmic Structure • Cooccurrence K=A A T • Matrix approximation by factoring A » USV T K » VS2 VT r = VS V h 2 T • LLR r = sparsify(A A)h T
  • 19. But Wait! • Cooccurrence K=A A T • Cross occurrence K=B A T
  • 20. For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • A’A gives query recommendation – “did you mean to ask for” • B’B gives video recommendation – “you might like these videos”
  • 21. The punch-line • B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • 22. Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres del paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 24. Hypothetical Example • Want a navigational ontology? • Just put labels on a web page with traffic – This gives A = users x label clicks • Remember viewing history – This gives B = users x items • Cross recommend – B’A = label to item mapping • After several users click, results are whatever users think they should be
  • 27. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization is critical • Agreement to “gold standard” is a non-issue
  • 34. For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 37. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 38. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 39. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O( nkd + k3d ) time • But for big data, k gets large
  • 40. Surrogate Method • Start with sloppy clustering into lots of clusters κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Results are provably good for highly clusterable data
  • 41. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 42. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d ( log k + log log n )) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality – result is k high-quality centroids • For many purposes, even the sloppy surrogate may suffice
  • 43. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 44. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 45. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 47. But Wait, … • Finding nearest centroid is inner loop • This could take O( d κ ) per point and κ can be big • Happily, approximate nearest centroid works fine
  • 48. Projection Search total ordering!
  • 49. 1 LSH Bit-match Versus Cosine 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is
  • 51. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads
  • 52. Quality • Ball k-means implementation appears significantly better than simple k-means • Streaming k-means + ball k-means appears to be about as good as ball k-means alone • All evaluations on 20 newsgroups with held-out data • Figure of merit is mean and median squared distance to nearest cluster
  • 53. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Get the code as part of Mahout trunk (or 0.8 very soon) • Contact me at tdunning@maprtech.com or @ted_dunning • Share news with @apachemahout