SlideShare a Scribd company logo
H2O.ai
Machine Intelligence
The joy of Clean Data!
San Francisco Big Data Science Meetup
15 Dec 2015
Matt Dowle
H2O.ai
Machine Intelligence
2
Overview
● For beginners
● Examples from my background
● Tools along the way
● Live demo of “80% munging”
● How H2O fits in
● Q & A
H2O.ai
Machine Intelligence
3
1996 – Lehman Brothers
● Just graduated – Applied Maths & Computing
● Dividend claims
● Cleaning at source; e.g. data entry typos
● Estimate cash flows, alerts etc
● Nothing fancy
● Tools: VB & Sybase
● How I accidentally created messy data
H2O.ai
Machine Intelligence
4
1999 - Salomon Brothers
● Equity risk model
– Multiple time series regression (10 year)
– DEM proxy for EUR prior to 1 Jan 1999
– IPOs get their sector's median; e.g. France
Telecom
– Abbey National X0004455
– 90% of the lines of code was not the
regression
H2O.ai
Machine Intelligence
5
2002 - Citigroup
● Pairs Trading
● 200 most liquid stocks
● 200 x 199 / 2 = 19,900 pairs
● Stock splits, id changes
● Dickey Fuller test for stationarity
● Bollinger bands => buy/sell signal
● Excel spreadsheet to clients with embedded S-
PLUS plot, daily, 50 custom variants
● Rebalance => orphan & surrogate pairs
H2O.ai
Machine Intelligence
6
2004
moved to fund
management
Bigger data
e.g. 25TB
H2O.ai
Machine Intelligence
7
H2O.ai
Machine Intelligence
8
Over cleaning
1. I queried for intra-day auctions
select from quote where bid>ask
No results; i.e. all bid<ask.
Asked data provider
Grrrrr
2. Negative prices can be correct
H2O.ai
Machine Intelligence
9
Tools
KDB http://kx.com/
@kxsystems
OneTick https://www.onetick.com/
@OneMarketData
H2O.ai
Machine Intelligence
10
H2O.ai
Machine Intelligence
11
https://www.youtube.com/watch?v=QxpOKbv-KQU
H2O.ai
Machine Intelligence
12
alias
awk
aws
bc
bigmler
body
cat
cols
csvcut
csvgrep
csvjoin
csvlook
csvsort
csvsql
csvstack
csvstat
curl
cut
dseq
find
for
grep
head
header
in2csv
jq
json2csv
less
parallel
paste
pbc
python, R and r
Rio
Rio-scatter
run_experiment
sample
scrape
sed
seq
shuf
sort
split
sql2csv
tail
tapkee
tee
tr
tree
uniq
wc
weka
xml2json
● Can be faster than loading the whole file into R or Python
● Can be faster workflow
● Pre-processing before loading into R or Python
H2O.ai
Machine Intelligence
13
tidyr by Hadley Wickham
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
● Untidy data defined as :
– Column headers are values, not variable names.
– Multiple variables are stored in one column.
– Variables are stored in both rows and columns.
– Multiple types of observational units stored in the same table.
– A single observational unit is stored in multiple tables.
● Solves by: gathering, separating and spreading
● That's the shape of the data. Yes, good, but not
the kind of messy data I'm talking about in this
presentation.
H2O.ai
Machine Intelligence
14
To illustrate
● In June 2013, RStudio made available
download logs from their CRAN mirror
http://blog.rstudio.org/2013/06/10/rstudio-cran-mirror/
● R-Bloggers search “CRAN download stats”
154 results; e.g.
http://www.r-bloggers.com/finally-tracking-cran-packages-downloads/
https://github.com/metacran/cranlogs
http://www.r-bloggers.com/working-with-the-rstudio-cran-logs/
http://www.r-bloggers.com/cran-download-statistics-of-any-packages-rstats/
http://www.r-bloggers.com/my-r-packages-worldmap-of-downloads/
H2O.ai
Machine Intelligence
15
March
2015
H2O.ai
Machine Intelligence
16
April
2015
H2O.ai
Machine Intelligence
17
●
Comparing the top downloaded packages with the
most discussed packages shows little correlations
between them. For Instance, ggplot2 has the
most questions asked & is the second highest
downloaded package but data.table package
(the second highest ranked R package for
questions asked) is not even in the top 100
packages downloaded. Knitr is another
example which is in the top 5 questions asked,
but is 27th ranked in downloaded packages.
●
So- does the R community need to focus on
packages that have the highest questions to
resolve their issues rather than the ones
with the most downloads?
H2O.ai
Machine Intelligence
18
Let's look at the data!
Live demo of munging
Observations and comments on
meetup video recording
https://youtu.be/4VWQEvYIfV8
( ~ 22 mins in )
H2O.ai
Machine Intelligence
19
“Big data”
1. Data > 240GB
needle-in-haystack e.g. fraud
2. Data < 240GB
compute intensive, parallel 100's cores
3. Data < 240GB
feature engineering > 240GB
Speed for i) production and ii) interaction
NB: 240GB is currently largest available on EC2
H2O.ai
Machine Intelligence
20
http://yourdatafitsinram.com/
H2O.ai
Machine Intelligence
21
Dell PowerEdge R920 60 core
( 4 * Intel® Xeon® E7-8880L 2.2GHz, 37.5M Cache, 15 Core )
with 1.5TB RAM $60k ( 96 * 16GB )
with 6TB $150k-$200k? ( 96 * 64GB )
But, still “only” 60 cores
In the office here we already have 2.5TB RAM
and 320 cores on 10 machines.
So do many businesses.
H2O.ai
Machine Intelligence
22
● data.table's radix join
● Now parallel and distributed
● e.g. high cardinality 1bn/1bn/1bn row join
data.table 10 min
H2O 1 node 32 core 3.5 min
H2O 4 node 128 core 1.5 min => demo
H2O 10 node 320 core 2.0 min
● Known improvements to be made
H2O.ai
Machine Intelligence
23
https://www.youtube.com/watch?v=8VpzNibOme0
H2O.ai
Machine Intelligence
24
Thank you.
Q & A

More Related Content

PDF
H2O Machine Learning and Kalman Filters for Machine Prognostics
PDF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
PDF
H2O Random Grid Search - PyData Amsterdam
PPTX
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
PDF
Intro to H2O in Python - Data Science LA
PDF
Introduction to Data Science with H2O- Mountain View
PDF
H2O AutoML roadmap - Ray Peck
PDF
Intro to Machine Learning with H2O and Python - Denver
H2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
H2O Random Grid Search - PyData Amsterdam
H2O.ai - Road Ahead - keynote presentation by Sri Ambati
Intro to H2O in Python - Data Science LA
Introduction to Data Science with H2O- Mountain View
H2O AutoML roadmap - Ray Peck
Intro to Machine Learning with H2O and Python - Denver

What's hot (20)

PDF
Intro to H2O Machine Learning in Python - Galvanize Seattle
PDF
Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Ed...
PDF
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
PDF
Intro to H2O Machine Learning in R at Santa Clara University
PDF
H2O Deep Water - Making Deep Learning Accessible to Everyone
PDF
Scalable and Automatic Machine Learning with H2O
PPTX
Data Science, Machine Learning, and H2O
PPTX
Self Guiding User Experience
PPTX
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
PDF
Intro to Machine Learning with H2O and AWS
PDF
Data Science in Future Tense
PDF
Introduction to Machine Learning with H2O and Python
PDF
Architecture in action 01
PDF
H2O at Berlin R Meetup
PPTX
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
PDF
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
PPTX
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
PDF
Dataiku pig - hive - cascading
PDF
Automatic and Interpretable Machine Learning in R with H2O and LIME
PPTX
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
Intro to H2O Machine Learning in Python - Galvanize Seattle
Automatic and Interpretable Machine Learning in R with H2O and LIME (Milan Ed...
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and Shiny
Intro to H2O Machine Learning in R at Santa Clara University
H2O Deep Water - Making Deep Learning Accessible to Everyone
Scalable and Automatic Machine Learning with H2O
Data Science, Machine Learning, and H2O
Self Guiding User Experience
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Intro to Machine Learning with H2O and AWS
Data Science in Future Tense
Introduction to Machine Learning with H2O and Python
Architecture in action 01
H2O at Berlin R Meetup
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Dataiku pig - hive - cascading
Automatic and Interpretable Machine Learning in R with H2O and LIME
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
Ad

Viewers also liked (20)

PDF
H2O World - What's New in H2O with Cliff Click
PPTX
H2O World - Self Guiding Applications with Venkatesh Yadav
PPTX
H2O World - Python Pipelines - Spencer Aiello
PDF
Basic H2O for Python with Eric Eckstrand
PPTX
H2O World - Translating Advanced Analytics for Business Users - Conor Jensen
PPTX
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
PDF
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
PDF
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Sparkling Water Meetup 4.15.15
PDF
H2O World - Building a Smarter Application - Tom Kraljevic
PPTX
Data & Data Alliances - Scott Mclellan
PDF
H2O World - What you need before doing predictive analysis - Keen.io
PDF
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
PDF
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
PDF
Twitter 数据分析
DOC
Sql常见面试题
PDF
MySQL_EXPLAIN_liling
PDF
可视化与可视分析从数据拥有者到数据用户的桥梁
PPTX
下一代推荐引擎的关键技术及应用案例
H2O World - What's New in H2O with Cliff Click
H2O World - Self Guiding Applications with Venkatesh Yadav
H2O World - Python Pipelines - Spencer Aiello
Basic H2O for Python with Eric Eckstrand
H2O World - Translating Advanced Analytics for Business Users - Conor Jensen
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - H2O Rains with Databricks Cloud
Sparkling Water Meetup 4.15.15
H2O World - Building a Smarter Application - Tom Kraljevic
Data & Data Alliances - Scott Mclellan
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
Twitter 数据分析
Sql常见面试题
MySQL_EXPLAIN_liling
可视化与可视分析从数据拥有者到数据用户的桥梁
下一代推荐引擎的关键技术及应用案例
Ad

Similar to The Joys of Clean Data with Matt Dowle (20)

PDF
data.table and H2O at LondonR with Matt Dowle
PPTX
Top-5-java-perf-problems-jax_mainz_2024.pptx
PDF
The Joys of Clean Data with Matt Dowle
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PPTX
Scalability
PPTX
Streaming in the Wild with Apache Flink
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
PPTX
Top-5-Performance-JaxLondon-2023.pptx
PPTX
Streaming in the Wild with Apache Flink
PDF
Apache Flink Adoption at Shopify
PPT
Virtualized Platform Migration On A Validated System
PDF
Big Data LDN 2017: H2O.ai Driverless AI: Fast, Accurate, Interpretable AI
PPTX
Auto ai for skillsfuture
PDF
How to write your database: the story about Event Store
PDF
[db tech showcase Tokyo 2017] A11: SQLite - The most used yet least appreciat...
PDF
Industrial IoT bootcamp
PPT
KSCOPE 2013: Exadata Consolidation Success Story
PDF
Website & Internet + Performance testing
PDF
Making the Most of In-Memory: More than Speed
PPTX
Machine Learning for Smarter Apps - Jacksonville Meetup
data.table and H2O at LondonR with Matt Dowle
Top-5-java-perf-problems-jax_mainz_2024.pptx
The Joys of Clean Data with Matt Dowle
End-to-end pipeline agility - Berlin Buzzwords 2024
Scalability
Streaming in the Wild with Apache Flink
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Top-5-Performance-JaxLondon-2023.pptx
Streaming in the Wild with Apache Flink
Apache Flink Adoption at Shopify
Virtualized Platform Migration On A Validated System
Big Data LDN 2017: H2O.ai Driverless AI: Fast, Accurate, Interpretable AI
Auto ai for skillsfuture
How to write your database: the story about Event Store
[db tech showcase Tokyo 2017] A11: SQLite - The most used yet least appreciat...
Industrial IoT bootcamp
KSCOPE 2013: Exadata Consolidation Success Story
Website & Internet + Performance testing
Making the Most of In-Memory: More than Speed
Machine Learning for Smarter Apps - Jacksonville Meetup

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
AutoCAD Professional Crack 2025 With License Key
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PDF
Website Design Services for Small Businesses.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
17 Powerful Integrations Your Next-Gen MLM Software Needs
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Designing Intelligence for the Shop Floor.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
AutoCAD Professional Crack 2025 With License Key
Computer Software and OS of computer science of grade 11.pptx
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Monitoring Stack: Grafana, Loki & Promtail
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Website Design Services for Small Businesses.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
wealthsignaloriginal-com-DS-text-... (1).pdf
Patient Appointment Booking in Odoo with online payment
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
17 Powerful Integrations Your Next-Gen MLM Software Needs
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Odoo Companies in India – Driving Business Transformation.pdf
CHAPTER 2 - PM Management and IT Context
Oracle Fusion HCM Cloud Demo for Beginners
Why Generative AI is the Future of Content, Code & Creativity?
Designing Intelligence for the Shop Floor.pdf

The Joys of Clean Data with Matt Dowle