SlideShare a Scribd company logo
H2O – The Open Source Math Engine
Big Data Science
with H2O in R
4/23/13
H2O –
Open Source Math
& Machine Learning
for Big Data
Anqi Fu, August 2013
Universe is sparse. Life is messy.
Data is sparse & messy.
- Lao Tzu
Introduction to Big Data
• There are about as many bits of information in our digital
universe as there are stars in our actual universe.
• The process to decode the human genome took 10 years.
It can now be done in a week.
• Big data means more than “lots of data”
H2O – The Open Source Math Engine
Better
Predictions
Same Interface
Installation
1. Install and run H2O
• Command line: java –Xmx2g –jar h2o.jar
• Pull up http://localhost:54321 in browser
2. Install the R package
• install.packages(c(“RCurl”, “rjson”, “bitops”))
• install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL,
type = "source")
3. In R console, type library(h2o)
• demo(package=“h2o”)
• demo(h2o.glm)
Replace this!
Always have H2O running first!
Basic R Script
1. Tell R where H2O is running:
localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321)
2. Check connection:
h2o.checkClient(localH2O)
3. Pass H2OClient as parameter to import:
h2o.importFile(localH2O, path=“Path/To/Data”, …)
Overview of Objects
• H2OClient: ip=character, port=numeric
• H2OParsedData: h2o=H2OClient, key=character
• H2OGLMModel: key=character, data=H2OParsedData,
model=list(coefficients, deviance, aic, etc)
Example: myModel@model$coefficients
H2O
key=“prostate.hex”
key=“airlines.hex”
Overview of Methods
Standard R H2O
read.csv, read.table, etc h2o.importFile, h2o.importURL
summary summary (limited to data only)
glm, glmnet h2o.glm(y, x, data, family, nfolds,
alpha, lambda)
kmeans h2o.kmeans(data, centers, cols,
iter.max)
randomForest, cforest h2o.randomForest(y, x_ignore,
data, ntree, depth, classwt)
Demo 1: Basic GLM in H2O through R
Demo 1: Prostate Cancer Data
• Prostate cancer data set from Ohio State University
Comprehensive Cancer Center
• N = 380 patients, ages ranging from 43-79
• Goal: Predict presence of tumor from baseline exam of
patient (age, race, PSA, total gleason score, etc)
Big datascienceh2oandr
Prostate Cancer
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Prostate Cancer
Logistic Regression Fit
Family: Binomial, Link: Logit
Data:
y = CAPSULE
0 = no tumor
1 = tumor
x = PSA
(prostate-specific antigen)
Goal:
Estimate probability
CAPSULE = 1
GLM Parameters
• y = response variable
• x = predictor variables (vector)
• family = binomial (default link = logit)
• data = H2OParsedData object
• nfolds = cross-validation
• lambda = weight on penalty factor
• alpha = elastic net mixing parameter
• alpha = 0 is ridge penalty (L2 norm)
• alpha = 1 is lasso penalty (L1 norm)
Under the Hood: Hacking R for H2O
Under the Hood
REST API
Data
(JSON)
Import
Parse
H2O
Data Scientist,
Analyst, etc
GLM Code Snippet
• Create an object to represent model
setClass("H2OGLMModel", representation(key="character",
data="H2OParsedData", model="list"))
• Declare new method for algorithm
setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha
= 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") })
Name Slots
Parameter Initial Value
GLM Code Snippet
setMethod("h2o.glm", signature(x="character", y="character",
data="H2OParsedData", …), function(x, y, data, …) {
• Send parameters to GLM.json page  GLM job started
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key
= data@key, y = y, x = paste(x, sep="", collapse=","), …)
• Keep polling and wait until job completed
while(h2o.__poll(data@h2o,
res$response$redirect_request_args$job) != -1) { Sys.sleep(1) }
• Query Inspect.json page with GLM model key to get results
res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT,
key=res$destination_key)
http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
Demo 2: Data Munging and Remote H2O
Demo 2: Airlines Data
• Airlines data set 1987-2013 from RITA (25%)
• Goal: Predict if flight’s arrival will be delayed
• Examine slices of data directly
head(airlines.hex, n = 10); tail(airlines.hex)
summary(airlines.hex$DepTime)
• Take a subset of data to play with in R
airlines.small = as.data.frame(airlines.hex[1:1000,])
glm(IsArrDelayed ~ Dest + Origin, family = binomial, data =
airlines.small)
Big datascienceh2oandr
http://www.transtats.bts.gov/Fields.asp?Table_ID=236
Connecting to H2O Remotely
• Your slip of paper contains IP/port of your assigned cluster
• Point R to remote H2O client
remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321)
• All data operations occur on cluster
h2o.importFile(remoteH2O, path =
“Path/On/Remote/Server/To/Data”, …)
• Objects/methods operate just like before!
Roadmap
• Long-term Goal: Full H2O/R Integration
• Subset col by name/index: df[,c(1,2)]; df[,”name”]
• Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1
• Filter rows: df[df$cName < 5,]
• Combine data frames by row/col: rbind, cbind
• Apply functions: tapply, sapply, lapply
• Support for R libraries (plyr, ggplot2, etc)
• More Algorithms: GBM, PCA, Neural Networks
4/23/13
Questions and
Suggestions?

More Related Content

PPTX
Big Data Science with H2O in R
PPTX
2014 moore-ddd
PPTX
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
PDF
Sea Amsterdam 2014 November 19
PDF
The Weather of the Century
PDF
The Weather of the Century Part 3: Visualization
PDF
Building A Web Application To Monitor PubMed Retraction Notices
PDF
Kyiv.py #16 october 2015
Big Data Science with H2O in R
2014 moore-ddd
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Sea Amsterdam 2014 November 19
The Weather of the Century
The Weather of the Century Part 3: Visualization
Building A Web Application To Monitor PubMed Retraction Notices
Kyiv.py #16 october 2015

What's hot (20)

PPTX
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
PPTX
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
PPTX
RedisConf17- durable_rules
PDF
RESTo - restful semantic search tool for geospatial
PDF
Mining and Untangling Change Genealogies (PhD Defense Talk)
PDF
JavascriptのGC入門
PDF
MongoDB: Intro & Application for Big Data
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
ODP
A Year With MongoDB: The Tips
PPTX
From Trill to Quill: Pushing the Envelope of Functionality and Scale
PDF
Apache spark session
PDF
RDO hangout on gnocchi
PDF
Diagnostics & Debugging webinar
PPTX
Diagnostics and Debugging
PDF
Akka with Scala
PPTX
Confidentiality as a service –usable security for the cloud
PPTX
MapReduce@DirectI
PDF
Influxdb and time series data
PPTX
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
PPTX
Operational Intelligence with MongoDB Webinar
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
RedisConf17- durable_rules
RESTo - restful semantic search tool for geospatial
Mining and Untangling Change Genealogies (PhD Defense Talk)
JavascriptのGC入門
MongoDB: Intro & Application for Big Data
Introduction to the Hadoop Ecosystem (codemotion Edition)
A Year With MongoDB: The Tips
From Trill to Quill: Pushing the Envelope of Functionality and Scale
Apache spark session
RDO hangout on gnocchi
Diagnostics & Debugging webinar
Diagnostics and Debugging
Akka with Scala
Confidentiality as a service –usable security for the cloud
MapReduce@DirectI
Influxdb and time series data
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
Operational Intelligence with MongoDB Webinar
Ad

Similar to Big datascienceh2oandr (20)

PPTX
2015 genome-center
PPT
Open Analytics Environment
PPTX
AI Development with H2O.ai
PDF
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
PDF
2014-10-10-SBC361-Reproducible research
PPTX
R Analytics in the Cloud
PPTX
XLDB South America Keynote: eScience Institute and Myria
PPT
Semantic Support for Complex Ecosystem Research Environments
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
PDF
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
PDF
Sharing massive data analysis: from provenance to linked experiment reports
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PPTX
The Web of Data: do we actually understand what we built?
PDF
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
PPTX
A Step Towards Reproducibility in R
PPTX
Software Sustainability: Better Software Better Science
PPT
Computation and Knowledge
PPTX
Analytics of analytics pipelines: from optimising re-execution to general Dat...
PPTX
Learning to assess Linked Data relationships using Genetic Programming
PPT
A New Partnership for Cross-Scale, Cross-Domain eScience
2015 genome-center
Open Analytics Environment
AI Development with H2O.ai
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
2014-10-10-SBC361-Reproducible research
R Analytics in the Cloud
XLDB South America Keynote: eScience Institute and Myria
Semantic Support for Complex Ecosystem Research Environments
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
Sharing massive data analysis: from provenance to linked experiment reports
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
The Web of Data: do we actually understand what we built?
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
A Step Towards Reproducibility in R
Software Sustainability: Better Software Better Science
Computation and Knowledge
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Learning to assess Linked Data relationships using Genetic Programming
A New Partnership for Cross-Scale, Cross-Domain eScience
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
PPTX
CHEM421 - Biochemistry (Chapter 1 - Introduction)
PDF
Transcultural that can help you someday.
PPTX
Clinical approach and Radiotherapy principles.pptx
PPT
ASRH Presentation for students and teachers 2770633.ppt
PPTX
2 neonat neotnatology dr hussein neonatologist
PPTX
regulatory aspects for Bulk manufacturing
PDF
Medical Evidence in the Criminal Justice Delivery System in.pdf
PPTX
vertigo topics for undergraduate ,mbbs/md/fcps
PPTX
Neuropathic pain.ppt treatment managment
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PDF
Hemostasis, Bleeding and Blood Transfusion.pdf
PPTX
Anatomy and physiology of the digestive system
PPTX
MANAGEMENT SNAKE BITE IN THE TROPICALS.pptx
PPTX
obstructive neonatal jaundice.pptx yes it is
PDF
شيت_عطا_0000000000000000000000000000.pdf
PPT
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
PPTX
Cardiovascular - antihypertensive medical backgrounds
PDF
Copy of OB - Exam #2 Study Guide. pdf
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
CHEM421 - Biochemistry (Chapter 1 - Introduction)
Transcultural that can help you someday.
Clinical approach and Radiotherapy principles.pptx
ASRH Presentation for students and teachers 2770633.ppt
2 neonat neotnatology dr hussein neonatologist
regulatory aspects for Bulk manufacturing
Medical Evidence in the Criminal Justice Delivery System in.pdf
vertigo topics for undergraduate ,mbbs/md/fcps
Neuropathic pain.ppt treatment managment
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
Hemostasis, Bleeding and Blood Transfusion.pdf
Anatomy and physiology of the digestive system
MANAGEMENT SNAKE BITE IN THE TROPICALS.pptx
obstructive neonatal jaundice.pptx yes it is
شيت_عطا_0000000000000000000000000000.pdf
STD NOTES INTRODUCTION TO COMMUNITY HEALT STRATEGY.ppt
Cardiovascular - antihypertensive medical backgrounds
Copy of OB - Exam #2 Study Guide. pdf

Big datascienceh2oandr

  • 1. H2O – The Open Source Math Engine Big Data Science with H2O in R
  • 2. 4/23/13 H2O – Open Source Math & Machine Learning for Big Data Anqi Fu, August 2013
  • 3. Universe is sparse. Life is messy. Data is sparse & messy. - Lao Tzu
  • 4. Introduction to Big Data • There are about as many bits of information in our digital universe as there are stars in our actual universe. • The process to decode the human genome took 10 years. It can now be done in a week. • Big data means more than “lots of data”
  • 5. H2O – The Open Source Math Engine Better Predictions Same Interface
  • 6. Installation 1. Install and run H2O • Command line: java –Xmx2g –jar h2o.jar • Pull up http://localhost:54321 in browser 2. Install the R package • install.packages(c(“RCurl”, “rjson”, “bitops”)) • install.packages(“Path/To/Package/ h2o_1.2.3.tar.gz", repos = NULL, type = "source") 3. In R console, type library(h2o) • demo(package=“h2o”) • demo(h2o.glm) Replace this!
  • 7. Always have H2O running first!
  • 8. Basic R Script 1. Tell R where H2O is running: localH2O = new(“H2OClient”, ip=“127.0.0.1”, port=54321) 2. Check connection: h2o.checkClient(localH2O) 3. Pass H2OClient as parameter to import: h2o.importFile(localH2O, path=“Path/To/Data”, …)
  • 9. Overview of Objects • H2OClient: ip=character, port=numeric • H2OParsedData: h2o=H2OClient, key=character • H2OGLMModel: key=character, data=H2OParsedData, model=list(coefficients, deviance, aic, etc) Example: myModel@model$coefficients H2O key=“prostate.hex” key=“airlines.hex”
  • 10. Overview of Methods Standard R H2O read.csv, read.table, etc h2o.importFile, h2o.importURL summary summary (limited to data only) glm, glmnet h2o.glm(y, x, data, family, nfolds, alpha, lambda) kmeans h2o.kmeans(data, centers, cols, iter.max) randomForest, cforest h2o.randomForest(y, x_ignore, data, ntree, depth, classwt)
  • 11. Demo 1: Basic GLM in H2O through R
  • 12. Demo 1: Prostate Cancer Data • Prostate cancer data set from Ohio State University Comprehensive Cancer Center • N = 380 patients, ages ranging from 43-79 • Goal: Predict presence of tumor from baseline exam of patient (age, race, PSA, total gleason score, etc)
  • 14. Prostate Cancer Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen)
  • 15. Prostate Cancer Logistic Regression Fit Family: Binomial, Link: Logit Data: y = CAPSULE 0 = no tumor 1 = tumor x = PSA (prostate-specific antigen) Goal: Estimate probability CAPSULE = 1
  • 16. GLM Parameters • y = response variable • x = predictor variables (vector) • family = binomial (default link = logit) • data = H2OParsedData object • nfolds = cross-validation • lambda = weight on penalty factor • alpha = elastic net mixing parameter • alpha = 0 is ridge penalty (L2 norm) • alpha = 1 is lasso penalty (L1 norm)
  • 17. Under the Hood: Hacking R for H2O
  • 18. Under the Hood REST API Data (JSON) Import Parse H2O Data Scientist, Analyst, etc
  • 19. GLM Code Snippet • Create an object to represent model setClass("H2OGLMModel", representation(key="character", data="H2OParsedData", model="list")) • Declare new method for algorithm setGeneric("h2o.glm", function(x, y, data, family, nfolds = 10, alpha = 0.5, lambda = 1.0e-5) { standardGeneric("h2o.glm") }) Name Slots Parameter Initial Value
  • 20. GLM Code Snippet setMethod("h2o.glm", signature(x="character", y="character", data="H2OParsedData", …), function(x, y, data, …) { • Send parameters to GLM.json page  GLM job started res = h2o.__remoteSend(data@h2o, h2o.__PAGE_GLM, key = data@key, y = y, x = paste(x, sep="", collapse=","), …) • Keep polling and wait until job completed while(h2o.__poll(data@h2o, res$response$redirect_request_args$job) != -1) { Sys.sleep(1) } • Query Inspect.json page with GLM model key to get results res = h2o.__remoteSend(data@h2o, h2o.__PAGE_INSPECT, key=res$destination_key) http://cran.r-project.org/doc/contrib/Genolini-S4tutorialV0-5en.pdf
  • 21. Demo 2: Data Munging and Remote H2O
  • 22. Demo 2: Airlines Data • Airlines data set 1987-2013 from RITA (25%) • Goal: Predict if flight’s arrival will be delayed • Examine slices of data directly head(airlines.hex, n = 10); tail(airlines.hex) summary(airlines.hex$DepTime) • Take a subset of data to play with in R airlines.small = as.data.frame(airlines.hex[1:1000,]) glm(IsArrDelayed ~ Dest + Origin, family = binomial, data = airlines.small)
  • 25. Connecting to H2O Remotely • Your slip of paper contains IP/port of your assigned cluster • Point R to remote H2O client remoteH2O = new(“H2OClient”, ip = “192.168.1.161”, port = 54321) • All data operations occur on cluster h2o.importFile(remoteH2O, path = “Path/On/Remote/Server/To/Data”, …) • Objects/methods operate just like before!
  • 26. Roadmap • Long-term Goal: Full H2O/R Integration • Subset col by name/index: df[,c(1,2)]; df[,”name”] • Add/Remove cols: df[,-c(1,2)]; df[,3] = df[,2] + 1 • Filter rows: df[df$cName < 5,] • Combine data frames by row/col: rbind, cbind • Apply functions: tapply, sapply, lapply • Support for R libraries (plyr, ggplot2, etc) • More Algorithms: GBM, PCA, Neural Networks

Editor's Notes

  • #7: http://docs.0xdata.com/quickstart/quickstart_R.htmlPackages  Install package(s)  Select CRAN mirror (US CA1)  Search for RCurl, rjson and bitops
  • #9: Pull up R and demo this in the console, making sure everyone can follow along
  • #10: H2OParsedData: Each data set/calculation associated with unique hex key, object acts like a “pointer”Model: coefficients, deviance, aic, df.residual, etc
  • #17: As penalty factor increases, lasso gives more sparse results (zero values), while ridge causes all coefficients to fall (but not hit zero necessarily)