SlideShare a Scribd company logo
Collective Inteligence Enginering  Not Bigger Research Picture (Tom) Progress Findings Proposal
Progress Reading.
Progress Investigate Compute Problem.
Progress Investigate Compute Problem. Pearson example.
Progress Investigate Compute Problem. Pearson example.
Progress Investigate Compute Problem. Pearson example. Store all comparisons = 1/2 N^2
Progress Investigate Scale N films M people M[(N(N-1)/2] time the algorithm cost Pearson:  Numerator  2 – , 1 *,  1+ Denominator 2 -, 2 ^2, 2 + M(N) time to compute averages Can be done on ingest in M(N) time
Progress Tractible? A typical P4  - theoretical max of 20-40 G FLOPs,  With L2cach bandwidth, supporting instructions etc. a max of 3-7 G Flop is more realistic. (my further benchmarking show 7GFlop on a dual core centrino) What could we expect from various technologies Matrix multiplication is a good estimate.....
Progress 41 mins 17 mins 8 mins
Progress 40 Seconds
Progress Genial MFlops Which correlate well with:  http://www.ient.rwth-aachen.de/~laurent/genial/benchmark_gemm_4T.html More investigation details at: http://jira.talis.com/browse/COL-5
Progress Computation a 1M x 1M dense matrix multiply results in at least 1M ^ 3 FLOP's = 1E18 = 1 exaflop. On a single P4 cpu this would take 1E18 / 7E9 = 142E6 seconds or 1653 days. So even on a matrix 100,000 a theoretical time of 1.65 days. Of course Comparisons are ½ this
Progress Realisation. Huge compute problem 1M matrix 1650 days Paralelise?  16.5 days on 100 nodes 1.65 days on 1000 nodes 10M matrix 1,650,000 days Paralelise?  16,500 days on 100 nodes 1,650 days on 1000 nodes 165 days on 10,000 nodes
Progress Brute force  IBM's US$133M Roadrunner sustaining over 1petaflops 12,960 IBM PowerXCell 8i CPUs and 6,480 AMD Opteron dual-core processors PowerXCell 32 GFLOPS (similar to GPU's) 10M matrix = 1000 seconds (100M = 11.6 Days)
Progress Brute force  Folding@Home (free!) is reached over 4.1 PFLOPS 10M matrix = 25 seconds ATI Radeon™ HD 4870 X2 2.4 Teraflops  500 * $500 + 250 *$1000 (each backplane) = $500,000 (£331,665) for 1 PFLOP
Progress Optimisations  Intuitively sparse - Ignore Nulls? How sparse? True Pearson for linear algebra requires zeros, but Nulls? Depends on data – generally yes I.e three people A, B, C - A has seen no films in common with B or C A has seen 10 films, B – 5 and C – 15 Pearson numerator for B would be –15 and c -25 So C is less similar to A than B is.  So can ignore nulls - tfft!
Progress 600 Elsevier Full Text Articles. Single core running C++ processes 20 articles / 80,000 terms per second. Computations way faster than dense matrix. Only 600 articles 150,000 unique terms. A little distraction...
Progress
Progress (from van Rijsbergen, 1979) The most frequent words are not the most descriptive More Optimisation Word Count (LSA) Characteristics Be carefull, the lower discriminatory words can provide good information... (and serendipity)
Progress
Progress How Sparse? Term Document Count from 2.18 million DBPedia Abstracts
Progress Distraction.... Some Least Popular (stemmed) – 3 docs each Accretionari - an increase in a beneficiary's share in an estate Accordiana - a musical radio series which was heard on CBS in 1934 Accokeek -  Located in the southwest corner of Prince George's County Nazarbayev - President of Kazakhstan The Most Popular  15938 – year 12476 – season 11410 – state 10758 – world 10722 – name Serendipity
Progress Very Sparse Turns out to be Zipf – Mandelbrot distribution. [1] G. K. Zipf, Human Behavior and the Principle of Least Effort. (Cam- bridge, Mass., 1949; Addison-; Wesley, 1965). [2] B. Mandelbrot, “An informational theory of the statistical structure of language”, in Communication Theory, ed. Willis Jackson. (Better- worths, 1953).  Word Count is .0025% dense Ignore Null for huge optimisation. 40,000 x less compute (using uniform density assumption) Zipf-Mandelbrot has the form:  y = P1/(x+P2)^P3.
Progress DBPedia words follow Zipf-Mandelbrot Zoomed in chunk of DBPedia word count y = P1/(x+P2)^P3 best fit regression (red curve)  with factors  P1 = 874150  P2 = 60.0000  P3 = 1.01000
Progress Calculate Comparisons. 2.18 million Abstracts – 1.3M unique terms. Fits in 2G ram - (1.3M*2.18M*.000025 * 21 = 1.5G ) (as it is .0025% dense each entry is 21 bytes)  Uniform density assumption  Comparisons computable in few minutes Not storeable in RAM (3.6 Tbytes!) Big underestimate (Stopped the run after 4 hours) Stored random 100 article sample, and comparing with all 2M others. (0.2Gb) to allow intuitative QA
Progress Extrapolating Compute Times. Not 40,000x less compute (.0025% dense) Word Count is .0025% dense - 40,000 x less compute  Top 1000 most commonly occurring terms the density is 0.78% Not 128x less (.78% dense) So use square of area under curve = Integral of Zipf Mandelbrot Squared Roughly a power law. y = (x^P)/N
Progress DBPedia Abstracs Ops (not including algorithm cost) Regression using simple power law (P = 2.1 N = 1E0.73)
Progress Extrapolating Compute Times. 2 Million abstracts Square of Integral of Zipf Mandelbrot Predicts calculable in 5.42 Hours Assuming all in RAM. But writing Gigs to disk big overhead. (Not run this yet to prove)
Progress Loans Data Ops (not including algorithm cost) (proving power law linear regression prediction) Regression using simple power law (P = 3.5 N = 1E10.15)
Progress Loans Data Hereford Libraries. C++ In memory 'super fast hash' Processed 19M loans in 1min 20sec. Producing 269,000 unique borrower and 491,000 unique books. 8 Million unique loan events
Progress Loans per Individual – Nice Zipf Mandelbrot curve
Progress Zipf Mandelbrot – Good Assumption? Most (all large complex systems?) data that we are likely to process will follow a Zipf-Mandlebrot model. K. Silagadze shows [1] that these comply... Clickstreams Page-rank (Linkage/Centrality) Citations Other long tail interactions. [1]Z. K. Silagadze [physics.soc-ph] 26 Jan 1999 Citations and the Zipf-Mandelbrot’s law - Budker Institute of Nuclear Physics, 630 090, Novosibirsk, Russia ing.
Findings Cant Store All Comparisons  50 Tb for 10M matrix (½ N^2) for triangle matrix. Store only meaningfull? – Thld = n or f. Can compute All 2M (squared) Comparisons In 6 hours (1 core). Cant Compute 1 Billion Comparisons 287 years (7 days on 20,000 cores 10Billion?). Zipf Mandelbrot Curve is Usefull. Can store All(?) raw metrics n=count - fixed f=factor - Z-M
Findings Zipf Mandelbrot Curve is Usefull. Head Big proportion of compute  Large M1 M2 intersection. Low discrimination Body Good info Medium compute Tail Specialist Trivial or no compute
Findings Zipf Mandelbrot Curve is Usefull. Allows us to make optimisations In reducing the y axis (and x a bit) Chop the head off. Body Dimensionality reduction Tail Chop the tail off Or Dimensionality reduction X axis = N^2, Y axis = M
Findings What about storing meaningfull comparisons. Solves storage problem Requires repeated compute problem Deltas, could affect whole set Will affect a chunk of the set Could trade off timely accuracy with batch processing.
Proposal Store raw curve Sparse Strorage – Bigtable like Hbase, Hypertable, etc Unloads indexing and lookup to nodes. Calculate on the fly Two indeces Books -> people and People -> books Not 1/2N^2 – just 1* Intersect  (* M) Tail – Retrieval problem M~0 Intersect~0. Body – Some retrieval and compute. Head – Big retrieval and compute big M big intersect.
Proposal Pre Compute Head Store top n Store any that take more than .5 seconds Zipf-Mandelbrot – retrieval only problem Dynamic – finding them is linear Store as a cache – only when requested? Depends on acceptible delay? This Hybrid Scales Better. Better than Storing all or Computnig all
Proposal It doesn't scale indefinately. * scale by 10, * nodes by 100 Dimensionality reduction will HAVE to kick in. This aproach allows that, but at bigger scles than most Consider severing head and tail as early aproach. Only optimised for individual requests. Given this article find the 10 next similar What about...”Given this corpus find the 100 most similar things” Then set n = infinity (or f=0) and the service will tell you how many days to come back for your results
Proposal Experimentation Used Hbase, HDFS, Hadoop Not using Hadoop yet – but is good fit for data ingest Hadoop not v efficient for comparison – but doable. Used Loan data and binary pearson Ignoring nulls (and sigma) – so counts only. Quick demo
Proposal Next Steps Prove this aproach Performance Testing. Hbase over n nodes (perf lab, poss then EC2?). Timing retrieval vs compute Good logging Configurable variables Multiple 'stores' (data sets). 8M loans now – 80M? Hadoop the ingest – if just to save time during trials

More Related Content

PPTX
Making AI efficient
PDF
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
PDF
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
PDF
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
PDF
強化学習の分散アーキテクチャ変遷
PPTX
Pycon 2016-open-space
PDF
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Making AI efficient
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
IIBMP2019 講演資料「オープンソースで始める深層学習」
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
[Japanese]Obake-GAN (Perturbative GAN): GAN with Perturbation Layers
強化学習の分散アーキテクチャ変遷
Pycon 2016-open-space
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC

What's hot (19)

PDF
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PPT
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
PDF
Europy17_dibernardo
PDF
Apache Nemo
PPTX
The next generation of the Montage image mosaic engine
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
A Short Course in Data Stream Mining
PDF
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
PDF
CNN Attention Networks
PDF
Achitecture Aware Algorithms and Software for Peta and Exascale
PDF
Dsp fundamentals part-ii
PDF
Landuse Classification from Satellite Imagery using Deep Learning
PDF
Large scale landuse classification of satellite imagery
PDF
First Place Memocode'14 Design Contest Entry
PPTX
Artificial Neural Network Implementation on FPGA
PDF
An Introduction to Neural Architecture Search
PDF
Dsp foundation part-i
PDF
Mining Top-k Closed Sequential Patterns in Sequential Databases
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Europy17_dibernardo
Apache Nemo
The next generation of the Montage image mosaic engine
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
A Short Course in Data Stream Mining
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
CNN Attention Networks
Achitecture Aware Algorithms and Software for Peta and Exascale
Dsp fundamentals part-ii
Landuse Classification from Satellite Imagery using Deep Learning
Large scale landuse classification of satellite imagery
First Place Memocode'14 Design Contest Entry
Artificial Neural Network Implementation on FPGA
An Introduction to Neural Architecture Search
Dsp foundation part-i
Mining Top-k Closed Sequential Patterns in Sequential Databases
Ad

Viewers also liked (20)

PPS
As linguas minorizadas nas universidades das illas británicas
PDF
Federal Injured Servicemember Programs
PPTX
PDF
e-Healthcare Infrastructural Research
PPT
Candidateintro 2011
PPT
Adoptie
PDF
Psychological Adjustment For Employability
PPT
Tbw Wgeneral.2008
PPTX
с праздником, мама!!!2
PPT
PDF
Peter Lik New Volcano images
PPTX
Emberjs as a rails_developer
PPT
Integrless
PPT
Fit For 21st Century
PPT
299
PDF
Peter Lik Nov08 New Release
PPTX
8marts2011
DOC
Match Foundation Business Executive Summary 12012008
PPTX
liofsocialemedia
As linguas minorizadas nas universidades das illas británicas
Federal Injured Servicemember Programs
e-Healthcare Infrastructural Research
Candidateintro 2011
Adoptie
Psychological Adjustment For Employability
Tbw Wgeneral.2008
с праздником, мама!!!2
Peter Lik New Volcano images
Emberjs as a rails_developer
Integrless
Fit For 21st Century
299
Peter Lik Nov08 New Release
8marts2011
Match Foundation Business Executive Summary 12012008
liofsocialemedia
Ad

Similar to End of Sprint 5 (20)

PPT
Parallel Computing 2007: Bring your own parallel application
PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PDF
Big Data com Python
PDF
Balogh gyorgy big_data
PPTX
UNIT-V.pptx-big data notes-ccs334anna university syllabus
DOC
Time and space complexity
PPTX
Np completeness
PDF
Data profiling with Apache Calcite
PDF
Data profiling in Apache Calcite
PPTX
BDI- The Beginning (Big data training in Coimbatore)
PPT
AI-search-metodsandeverythingelsenot.ppt
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
PDF
Data Profiling in Apache Calcite
PDF
P versus NP
PPT
Chapter3 Search
PDF
Implementation of Computational Algorithms using Parallel Programming
PPTX
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
PDF
MachineLearning_Road to deep learning.pdf
Parallel Computing 2007: Bring your own parallel application
The Other HPC: High Productivity Computing in Polystore Environments
Big Data com Python
Balogh gyorgy big_data
UNIT-V.pptx-big data notes-ccs334anna university syllabus
Time and space complexity
Np completeness
Data profiling with Apache Calcite
Data profiling in Apache Calcite
BDI- The Beginning (Big data training in Coimbatore)
AI-search-metodsandeverythingelsenot.ppt
Some Information Retrieval Models and Our Experiments for TREC KBA
Data Profiling in Apache Calcite
P versus NP
Chapter3 Search
Implementation of Computational Algorithms using Parallel Programming
Soft computing (ANN and Fuzzy Logic) : Dr. Purnima Pandit
MachineLearning_Road to deep learning.pdf

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Review of recent advances in non-invasive hemoglobin estimation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
sap open course for s4hana steps from ECC to s4

End of Sprint 5

  • 1. Collective Inteligence Enginering Not Bigger Research Picture (Tom) Progress Findings Proposal
  • 4. Progress Investigate Compute Problem. Pearson example.
  • 5. Progress Investigate Compute Problem. Pearson example.
  • 6. Progress Investigate Compute Problem. Pearson example. Store all comparisons = 1/2 N^2
  • 7. Progress Investigate Scale N films M people M[(N(N-1)/2] time the algorithm cost Pearson: Numerator 2 – , 1 *, 1+ Denominator 2 -, 2 ^2, 2 + M(N) time to compute averages Can be done on ingest in M(N) time
  • 8. Progress Tractible? A typical P4 - theoretical max of 20-40 G FLOPs, With L2cach bandwidth, supporting instructions etc. a max of 3-7 G Flop is more realistic. (my further benchmarking show 7GFlop on a dual core centrino) What could we expect from various technologies Matrix multiplication is a good estimate.....
  • 9. Progress 41 mins 17 mins 8 mins
  • 11. Progress Genial MFlops Which correlate well with: http://www.ient.rwth-aachen.de/~laurent/genial/benchmark_gemm_4T.html More investigation details at: http://jira.talis.com/browse/COL-5
  • 12. Progress Computation a 1M x 1M dense matrix multiply results in at least 1M ^ 3 FLOP's = 1E18 = 1 exaflop. On a single P4 cpu this would take 1E18 / 7E9 = 142E6 seconds or 1653 days. So even on a matrix 100,000 a theoretical time of 1.65 days. Of course Comparisons are ½ this
  • 13. Progress Realisation. Huge compute problem 1M matrix 1650 days Paralelise? 16.5 days on 100 nodes 1.65 days on 1000 nodes 10M matrix 1,650,000 days Paralelise? 16,500 days on 100 nodes 1,650 days on 1000 nodes 165 days on 10,000 nodes
  • 14. Progress Brute force IBM's US$133M Roadrunner sustaining over 1petaflops 12,960 IBM PowerXCell 8i CPUs and 6,480 AMD Opteron dual-core processors PowerXCell 32 GFLOPS (similar to GPU's) 10M matrix = 1000 seconds (100M = 11.6 Days)
  • 15. Progress Brute force Folding@Home (free!) is reached over 4.1 PFLOPS 10M matrix = 25 seconds ATI Radeon™ HD 4870 X2 2.4 Teraflops 500 * $500 + 250 *$1000 (each backplane) = $500,000 (£331,665) for 1 PFLOP
  • 16. Progress Optimisations Intuitively sparse - Ignore Nulls? How sparse? True Pearson for linear algebra requires zeros, but Nulls? Depends on data – generally yes I.e three people A, B, C - A has seen no films in common with B or C A has seen 10 films, B – 5 and C – 15 Pearson numerator for B would be –15 and c -25 So C is less similar to A than B is. So can ignore nulls - tfft!
  • 17. Progress 600 Elsevier Full Text Articles. Single core running C++ processes 20 articles / 80,000 terms per second. Computations way faster than dense matrix. Only 600 articles 150,000 unique terms. A little distraction...
  • 19. Progress (from van Rijsbergen, 1979) The most frequent words are not the most descriptive More Optimisation Word Count (LSA) Characteristics Be carefull, the lower discriminatory words can provide good information... (and serendipity)
  • 21. Progress How Sparse? Term Document Count from 2.18 million DBPedia Abstracts
  • 22. Progress Distraction.... Some Least Popular (stemmed) – 3 docs each Accretionari - an increase in a beneficiary's share in an estate Accordiana - a musical radio series which was heard on CBS in 1934 Accokeek - Located in the southwest corner of Prince George's County Nazarbayev - President of Kazakhstan The Most Popular 15938 – year 12476 – season 11410 – state 10758 – world 10722 – name Serendipity
  • 23. Progress Very Sparse Turns out to be Zipf – Mandelbrot distribution. [1] G. K. Zipf, Human Behavior and the Principle of Least Effort. (Cam- bridge, Mass., 1949; Addison-; Wesley, 1965). [2] B. Mandelbrot, “An informational theory of the statistical structure of language”, in Communication Theory, ed. Willis Jackson. (Better- worths, 1953). Word Count is .0025% dense Ignore Null for huge optimisation. 40,000 x less compute (using uniform density assumption) Zipf-Mandelbrot has the form: y = P1/(x+P2)^P3.
  • 24. Progress DBPedia words follow Zipf-Mandelbrot Zoomed in chunk of DBPedia word count y = P1/(x+P2)^P3 best fit regression (red curve) with factors P1 = 874150 P2 = 60.0000 P3 = 1.01000
  • 25. Progress Calculate Comparisons. 2.18 million Abstracts – 1.3M unique terms. Fits in 2G ram - (1.3M*2.18M*.000025 * 21 = 1.5G ) (as it is .0025% dense each entry is 21 bytes) Uniform density assumption Comparisons computable in few minutes Not storeable in RAM (3.6 Tbytes!) Big underestimate (Stopped the run after 4 hours) Stored random 100 article sample, and comparing with all 2M others. (0.2Gb) to allow intuitative QA
  • 26. Progress Extrapolating Compute Times. Not 40,000x less compute (.0025% dense) Word Count is .0025% dense - 40,000 x less compute Top 1000 most commonly occurring terms the density is 0.78% Not 128x less (.78% dense) So use square of area under curve = Integral of Zipf Mandelbrot Squared Roughly a power law. y = (x^P)/N
  • 27. Progress DBPedia Abstracs Ops (not including algorithm cost) Regression using simple power law (P = 2.1 N = 1E0.73)
  • 28. Progress Extrapolating Compute Times. 2 Million abstracts Square of Integral of Zipf Mandelbrot Predicts calculable in 5.42 Hours Assuming all in RAM. But writing Gigs to disk big overhead. (Not run this yet to prove)
  • 29. Progress Loans Data Ops (not including algorithm cost) (proving power law linear regression prediction) Regression using simple power law (P = 3.5 N = 1E10.15)
  • 30. Progress Loans Data Hereford Libraries. C++ In memory 'super fast hash' Processed 19M loans in 1min 20sec. Producing 269,000 unique borrower and 491,000 unique books. 8 Million unique loan events
  • 31. Progress Loans per Individual – Nice Zipf Mandelbrot curve
  • 32. Progress Zipf Mandelbrot – Good Assumption? Most (all large complex systems?) data that we are likely to process will follow a Zipf-Mandlebrot model. K. Silagadze shows [1] that these comply... Clickstreams Page-rank (Linkage/Centrality) Citations Other long tail interactions. [1]Z. K. Silagadze [physics.soc-ph] 26 Jan 1999 Citations and the Zipf-Mandelbrot’s law - Budker Institute of Nuclear Physics, 630 090, Novosibirsk, Russia ing.
  • 33. Findings Cant Store All Comparisons 50 Tb for 10M matrix (½ N^2) for triangle matrix. Store only meaningfull? – Thld = n or f. Can compute All 2M (squared) Comparisons In 6 hours (1 core). Cant Compute 1 Billion Comparisons 287 years (7 days on 20,000 cores 10Billion?). Zipf Mandelbrot Curve is Usefull. Can store All(?) raw metrics n=count - fixed f=factor - Z-M
  • 34. Findings Zipf Mandelbrot Curve is Usefull. Head Big proportion of compute Large M1 M2 intersection. Low discrimination Body Good info Medium compute Tail Specialist Trivial or no compute
  • 35. Findings Zipf Mandelbrot Curve is Usefull. Allows us to make optimisations In reducing the y axis (and x a bit) Chop the head off. Body Dimensionality reduction Tail Chop the tail off Or Dimensionality reduction X axis = N^2, Y axis = M
  • 36. Findings What about storing meaningfull comparisons. Solves storage problem Requires repeated compute problem Deltas, could affect whole set Will affect a chunk of the set Could trade off timely accuracy with batch processing.
  • 37. Proposal Store raw curve Sparse Strorage – Bigtable like Hbase, Hypertable, etc Unloads indexing and lookup to nodes. Calculate on the fly Two indeces Books -> people and People -> books Not 1/2N^2 – just 1* Intersect (* M) Tail – Retrieval problem M~0 Intersect~0. Body – Some retrieval and compute. Head – Big retrieval and compute big M big intersect.
  • 38. Proposal Pre Compute Head Store top n Store any that take more than .5 seconds Zipf-Mandelbrot – retrieval only problem Dynamic – finding them is linear Store as a cache – only when requested? Depends on acceptible delay? This Hybrid Scales Better. Better than Storing all or Computnig all
  • 39. Proposal It doesn't scale indefinately. * scale by 10, * nodes by 100 Dimensionality reduction will HAVE to kick in. This aproach allows that, but at bigger scles than most Consider severing head and tail as early aproach. Only optimised for individual requests. Given this article find the 10 next similar What about...”Given this corpus find the 100 most similar things” Then set n = infinity (or f=0) and the service will tell you how many days to come back for your results
  • 40. Proposal Experimentation Used Hbase, HDFS, Hadoop Not using Hadoop yet – but is good fit for data ingest Hadoop not v efficient for comparison – but doable. Used Loan data and binary pearson Ignoring nulls (and sigma) – so counts only. Quick demo
  • 41. Proposal Next Steps Prove this aproach Performance Testing. Hbase over n nodes (perf lab, poss then EC2?). Timing retrieval vs compute Good logging Configurable variables Multiple 'stores' (data sets). 8M loans now – 80M? Hadoop the ingest – if just to save time during trials