SlideShare a Scribd company logo
Linear Regression on 1 Terabytes of Data?
Some Crazy Observations and Actions
Hesen Peng
Amazon.com
Big Data Exploration with Amazon
Model building procedure for
a major internet company
Planning and
Idea Generation
Data collection
Model building
and offline
evaluation
Implementation
for application
online
Performance
evaluation in
real world
Experiment
Design,
Clinical Trial
Major Machine
Learning/Stat
research
Interesting
weekend project
Unsupervised
Machine Learning,
Survival analysis
Power Point
Linear regression with 1TB of data
Wanna try it out?
• Use Amazon Web Service! (with free tire)
– http://aws.amazon.com/education/
• Write simple distributed algorithm:
– Python: MRJob (https://github.com/Yelp/mrjob)
– R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop)
– Launch your own Sun/Oracle Grid Engine
environment for parallel computing
(http://star.mit.edu/cluster/)
New Challenges
• Association beyond linear
– Make better use of data: (most) factors are statistically
significant in linear models with 1 TB of data
– (Better?) Prediction
• Everything goes to real time
– Build/ update model, analytics, data storage in real
time
– Faster response to new happenings
– Save engineering overhead
Real time big data analytics work flow
Real time data input
(training + testing data)
Real time analytics front
end
Dashboarding/
monitoring
Model building / update
Prediction server
Outlier detection and
pre-processing
Huge Statistical
ChallengeTree design rather than
ring design, enabling
parallel construction and
update
Where are we?
Offline model
building and
scheduled updating
Linear regression / GLM
using Mahout etc
Random
Forest, SVM, Hashing, and
beyond
Mutual
information, Brownian
Covariate, Mira score, and
density estimation!
Batch processing and
near real time
updating
Batch update to the linear
model
Batch update of random
forest, adaptively throw
away trees
?
Real time data
processing / cleaning
and model building
Linear model built and
consumed in real time
?
Real time universal
association discovery !
Timeliness of model build
Complexityof
association
Universal association discovery
• Discovere associations between to random
vectors
• Regardless of dimension and association form
(linear / nonlinear/ higher order interaction).
• E.g. Mutual information, Brownian Distance
Covariate, Mira score (1NN edge sum)
Intuition
Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection
among Continuous Variables using Functions on the Observation Graph
Mira score: another function on the
distance graph
• Where d(i) is the distance between observation i
and its nearest neighbore.
• O(N2P)
• How to adapt to real time analytics?
– Segment data for batch processing
– Keep partial data in memory and change the
calculation function
From O(N2P) to O(NP)
A whole distance
matrix between
observations
Only keep the most up-to-
date few in memory and
calculate NN distance btw
observations kept in memory
Yes, loss of power;
assuming association is
independent of
sequence of observation
We are still at Day 1
• Mira score: only capable of detecting association
between continuous variables
– SeMira: variable selection
– No prediction yet
• Functions on the distance graph is a gold mine.
• Real time analytics = $$$
– Fraud detection
– Clustering
– Recommendation systems
Join Us!
• Ask Hesen for referral:
hesepeng@amazon.com
• http://www.amazon.com/gp/jobs
• Jobs of all levels:
– Research Scientist
– Business Intelligence Engineer
– Software Development Engineers
– Machine Learning scientist
– Manager in Machine Learning

More Related Content

PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Evolutionary Design of Swarms (SSCI 2014)
PDF
Automatic machine learning (AutoML) 101
PDF
Visual diagnostics for more effective machine learning
PPTX
Graph Based Machine Learning on Relational Data
PPTX
Ferruzza g automl deck
PPTX
Feature Engineering
A Beginner's Guide to Machine Learning with Scikit-Learn
Introduction to Machine Learning with SciKit-Learn
Evolutionary Design of Swarms (SSCI 2014)
Automatic machine learning (AutoML) 101
Visual diagnostics for more effective machine learning
Graph Based Machine Learning on Relational Data
Ferruzza g automl deck
Feature Engineering

What's hot (20)

PDF
VSSML16 LR1. Summary Day 1
PDF
Azure Machine Learning and ML on Premises
PDF
Data! Data! Data! I Can't Make Bricks Without Clay!
PDF
Introduction to Machine Learning in Python using Scikit-Learn
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PDF
Modern Machine Learning Infrastructure and Practices
PPTX
Towards a Comprehensive Machine Learning Benchmark
PDF
Open and Automated Machine Learning
PDF
A tour of the top 10 algorithms for machine learning newbies
PDF
Machine learning for_finance
PPTX
Clustering: A Scikit Learn Tutorial
PDF
One Algorithm to Rule Them All: How to Automate Statistical Computation
PDF
R Packages for Time-Varying Networks and Extremal Dependence
PPTX
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
PDF
L11. The Future of Machine Learning
PDF
Intro to machine learning
PDF
Joey gonzalez, graph lab, m lconf 2013
PDF
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
PPTX
Magellan FOSS4G Talk, Boston 2017
VSSML16 LR1. Summary Day 1
Azure Machine Learning and ML on Premises
Data! Data! Data! I Can't Make Bricks Without Clay!
Introduction to Machine Learning in Python using Scikit-Learn
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Modern Machine Learning Infrastructure and Practices
Towards a Comprehensive Machine Learning Benchmark
Open and Automated Machine Learning
A tour of the top 10 algorithms for machine learning newbies
Machine learning for_finance
Clustering: A Scikit Learn Tutorial
One Algorithm to Rule Them All: How to Automate Statistical Computation
R Packages for Time-Varying Networks and Extremal Dependence
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
L11. The Future of Machine Learning
Intro to machine learning
Joey gonzalez, graph lab, m lconf 2013
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Magellan FOSS4G Talk, Boston 2017
Ad

Similar to Linear regression on 1 terabytes of data? Some crazy observations and actions (20)

PDF
Data Data Everywhere: Not An Insight to Take Action Upon
PPTX
Machine learning introduction
PDF
Analytics Types.pdfdvf ifbvuibugdfiubuibubufdibhdfiubfduibhfiuvdih
PPTX
Gaussian Processes and Time Series.pptx
PDF
Machine learning at b.e.s.t. summer university
PDF
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
PPTX
# Neural network toolbox
PDF
Big Data Science - hype?
PPTX
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
PPTX
01 Introduction to Data Mining
PDF
stackconf 2024 | IGNITE: Practical AI with Machine Learning for Observability...
PPTX
datamining-lect1.pptx
PDF
chương 1 - Tổng quan về khai phá dữ liệu.pdf
PPTX
Introduction about Applications of data mining
PPTX
Deep ar presentation
PDF
Introductions to Online Machine Learning Algorithms
PDF
Basic course for computer based methods
PDF
Is this normal?
PPTX
Social Νetworks Data Mining
Data Data Everywhere: Not An Insight to Take Action Upon
Machine learning introduction
Analytics Types.pdfdvf ifbvuibugdfiubuibubufdibhdfiubfduibhfiuvdih
Gaussian Processes and Time Series.pptx
Machine learning at b.e.s.t. summer university
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
# Neural network toolbox
Big Data Science - hype?
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
01 Introduction to Data Mining
stackconf 2024 | IGNITE: Practical AI with Machine Learning for Observability...
datamining-lect1.pptx
chương 1 - Tổng quan về khai phá dữ liệu.pdf
Introduction about Applications of data mining
Deep ar presentation
Introductions to Online Machine Learning Algorithms
Basic course for computer based methods
Is this normal?
Social Νetworks Data Mining
Ad

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Linear regression on 1 terabytes of data? Some crazy observations and actions

  • 1. Linear Regression on 1 Terabytes of Data? Some Crazy Observations and Actions Hesen Peng Amazon.com Big Data Exploration with Amazon
  • 2. Model building procedure for a major internet company Planning and Idea Generation Data collection Model building and offline evaluation Implementation for application online Performance evaluation in real world Experiment Design, Clinical Trial Major Machine Learning/Stat research Interesting weekend project Unsupervised Machine Learning, Survival analysis Power Point
  • 4. Wanna try it out? • Use Amazon Web Service! (with free tire) – http://aws.amazon.com/education/ • Write simple distributed algorithm: – Python: MRJob (https://github.com/Yelp/mrjob) – R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop) – Launch your own Sun/Oracle Grid Engine environment for parallel computing (http://star.mit.edu/cluster/)
  • 5. New Challenges • Association beyond linear – Make better use of data: (most) factors are statistically significant in linear models with 1 TB of data – (Better?) Prediction • Everything goes to real time – Build/ update model, analytics, data storage in real time – Faster response to new happenings – Save engineering overhead
  • 6. Real time big data analytics work flow Real time data input (training + testing data) Real time analytics front end Dashboarding/ monitoring Model building / update Prediction server Outlier detection and pre-processing Huge Statistical ChallengeTree design rather than ring design, enabling parallel construction and update
  • 7. Where are we? Offline model building and scheduled updating Linear regression / GLM using Mahout etc Random Forest, SVM, Hashing, and beyond Mutual information, Brownian Covariate, Mira score, and density estimation! Batch processing and near real time updating Batch update to the linear model Batch update of random forest, adaptively throw away trees ? Real time data processing / cleaning and model building Linear model built and consumed in real time ? Real time universal association discovery ! Timeliness of model build Complexityof association
  • 8. Universal association discovery • Discovere associations between to random vectors • Regardless of dimension and association form (linear / nonlinear/ higher order interaction). • E.g. Mutual information, Brownian Distance Covariate, Mira score (1NN edge sum)
  • 9. Intuition Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection among Continuous Variables using Functions on the Observation Graph
  • 10. Mira score: another function on the distance graph • Where d(i) is the distance between observation i and its nearest neighbore. • O(N2P) • How to adapt to real time analytics? – Segment data for batch processing – Keep partial data in memory and change the calculation function
  • 11. From O(N2P) to O(NP) A whole distance matrix between observations Only keep the most up-to- date few in memory and calculate NN distance btw observations kept in memory Yes, loss of power; assuming association is independent of sequence of observation
  • 12. We are still at Day 1 • Mira score: only capable of detecting association between continuous variables – SeMira: variable selection – No prediction yet • Functions on the distance graph is a gold mine. • Real time analytics = $$$ – Fraud detection – Clustering – Recommendation systems
  • 13. Join Us! • Ask Hesen for referral: hesepeng@amazon.com • http://www.amazon.com/gp/jobs • Jobs of all levels: – Research Scientist – Business Intelligence Engineer – Software Development Engineers – Machine Learning scientist – Manager in Machine Learning