SlideShare a Scribd company logo
2015 Data Science Summit @ dato
• 4:00am Got up and drove to LAX
• 7:30am arrived SFO
•8:20am ~ 6:30pm Data science summit
• 9:00pm ~ 12:10am Flight delayed in SFO (-___-|||)
• 1:30am Arrived LAX (still crowded!)
A 22 Hours Trip
What are people talking about?
• Platform & system
• Machine learning
• Graph
• Visualization
• Scale
• Spark
• …
Session Topics
Dato Data science stack
https://dato.com/products/
• SArray SFrame and SGraph: Scalable External Memory Data Frame and Graph Structures for Machine Learning
• GraphLab Create™ Translator https://dato.com/learn/translator/ GraphLab Create (ver. 1.0) vs Pandas (ver. 0.15.0)
vs R (ver. 3.1.1)
Dato SArray SFrame And SGraph
BDAS, the Berkeley Data Analytics Stack
https://amplab.cs.berkeley.edu/software/
• IBM System ML, the machine learning platform developed at the company's facility in Almaden, is to be contributed
to Apache Spark.
IBM System ML
http://www.theinquirer.net/inquirer/news/2413132/ibm-donates-machine-learning-tech-to-apache-spark-
open-source-community
• Apache Flink is an open source platform for scalable batch and stream data processing.
• Focused on large-scale data analytics
• Unified real-time stream and batch processing
• Expressive and rich APIs in Java / Scala (+Python)
• Robust and fast execution backed
• Some similar system storm? spark? …
Apache Flink
https://flink.apache.org/index.html
Python Data Science stack
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote
Ibis on Impala: Python at Scale for Data Science
http://blog.cloudera.com/blog/2015/07/ibis-on-impala-python-at-scale-for-data-science/
http://www.ibis-project.org/
• PredictionIO is an open source machine learning framework for developers and data scientists. It
supports event collection, deployment of algorithms, evaluation, querying predictive results via REST
APIs.
• Built on Apache Spark, HBase and Spray.
PredictionIO
https://github.com/PredictionIO/PredictionIO/
• Splash is a general framework for parallelizing stochastic learning algorithms on multi-node clusters.
• Splash is built on Scala and Apache Spark
Splash
http://zhangyuc.github.io/splash/
• Discriminative, low-dimensional, large-scale featurization – the easy way
• A new way to process categorical features. It is a scaling way.
Learning with Counts
Learning with Counts (Cont.)
•Three small wishes of data science.
•Is more data always better?
•Are more variables always better?
•Can’t we keep at least some of our training performance?
Statistics in the age of data science, issues you can not ignore
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
Other Interesting Sessions (Didn’t attend)
• Applying data science to sales pipelines — for fun and profit http://es.slideshare.net/dato-inc/applying-
data-science-to-sales-pipelines-for-fun-and-profit-51025224
• What we've learned from over 1MM machine learning models
• DeepLearning4J: Open Source Neural Net Platform
• Scalable on Hadoop, Spark and Akka + AWS et al
• http://deeplearning4j.org/
• Five key learnings from building a large scale anomaly detection system http://www.slideshare.net/dato-
inc/five-key-learnings-from-building-a-large-scale-anomaly-detection-system
Reference
• http://conf.dato.com/agenda/
• http://www.slideshare.net/dato-
inc?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview
Thanks!

More Related Content

PDF
Strata San Jose 2016: Scalable Ensemble Learning with H2O
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
PDF
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
PPTX
ETL & Machine Learning
PDF
H2O Rains with Databricks Cloud - NY 02.16.16
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
PDF
Scalable Automatic Machine Learning in H2O
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
Strata San Jose 2016: Scalable Ensemble Learning with H2O
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
ETL & Machine Learning
H2O Rains with Databricks Cloud - NY 02.16.16
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Scalable Automatic Machine Learning in H2O
Tuning ML Models: Scaling, Workflows, and Architecture

What's hot (20)

PDF
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
PDF
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...
PPTX
Automate your Machine Learning
PDF
Intro to H2O Machine Learning in R at Santa Clara University
PDF
H2O Rains with Databricks Cloud - Parisoma SF
PDF
The Quest for an Open Source Data Science Platform
PPTX
H2O intro at Dallas Meetup
PDF
Spark + H20 = Machine Learning at scale
PDF
Scalable Machine Learning in R and Python with H2O
PPTX
AI from your data lake: Using Solr for analytics
PDF
Sawtooth Windows for Feature Aggregations
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
Serverless data pipelines gcp
PDF
Intro to H2O Machine Learning in Python - Galvanize Seattle
PDF
Graph-Powered Machine Learning
PDF
Splice Machine's use of Apache Spark and MLflow
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PPTX
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PPTX
END-TO-END MACHINE LEARNING STACK
Productionizing H2O Models with Apache Spark with Jakub Hava and Michal Maloh...
Jakub Hava, H2O.ai - Productionizing Apache Spark Models using H2O - H2O Worl...
Automate your Machine Learning
Intro to H2O Machine Learning in R at Santa Clara University
H2O Rains with Databricks Cloud - Parisoma SF
The Quest for an Open Source Data Science Platform
H2O intro at Dallas Meetup
Spark + H20 = Machine Learning at scale
Scalable Machine Learning in R and Python with H2O
AI from your data lake: Using Solr for analytics
Sawtooth Windows for Feature Aggregations
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Serverless data pipelines gcp
Intro to H2O Machine Learning in Python - Galvanize Seattle
Graph-Powered Machine Learning
Splice Machine's use of Apache Spark and MLflow
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
END-TO-END MACHINE LEARNING STACK
Ad

Similar to 2015 Data Science Summit @ dato Review (20)

PDF
Machine learning model to production
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
PPTX
Scaling Data Science on Big Data
PDF
Data Science with Spark
PDF
Dev Ops Training
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PPTX
Coding software and tools used for data science management - Phdassistance
PPTX
In Memory Analytics with Apache Spark
PPTX
A short introduction to Spark and its benefits
PPTX
Is Spark the right choice for data analysis ?
PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
PDF
Power Software Development with Apache Spark
PPTX
So your boss says you need to learn data science
PDF
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
PPTX
What’s New in the Berkeley Data Analytics Stack
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Data Science meets Software Development
PDF
20151015 zagreb spark_notebooks
Machine learning model to production
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Scaling Data Science on Big Data
Data Science with Spark
Dev Ops Training
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Coding software and tools used for data science management - Phdassistance
In Memory Analytics with Apache Spark
A short introduction to Spark and its benefits
Is Spark the right choice for data analysis ?
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Power Software Development with Apache Spark
So your boss says you need to learn data science
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
What’s New in the Berkeley Data Analytics Stack
Spark Under the Hood - Meetup @ Data Science London
Data Science meets Software Development
20151015 zagreb spark_notebooks
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Introduction to Business Data Analytics.
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
STUDY DESIGN details- Lt Col Maksud (21).pptx
Fluorescence-microscope_Botany_detailed content
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Business Data Analytics.
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

2015 Data Science Summit @ dato Review

  • 1. 2015 Data Science Summit @ dato
  • 2. • 4:00am Got up and drove to LAX • 7:30am arrived SFO •8:20am ~ 6:30pm Data science summit • 9:00pm ~ 12:10am Flight delayed in SFO (-___-|||) • 1:30am Arrived LAX (still crowded!) A 22 Hours Trip
  • 3. What are people talking about?
  • 4. • Platform & system • Machine learning • Graph • Visualization • Scale • Spark • … Session Topics
  • 5. Dato Data science stack https://dato.com/products/
  • 6. • SArray SFrame and SGraph: Scalable External Memory Data Frame and Graph Structures for Machine Learning • GraphLab Create™ Translator https://dato.com/learn/translator/ GraphLab Create (ver. 1.0) vs Pandas (ver. 0.15.0) vs R (ver. 3.1.1) Dato SArray SFrame And SGraph
  • 7. BDAS, the Berkeley Data Analytics Stack https://amplab.cs.berkeley.edu/software/
  • 8. • IBM System ML, the machine learning platform developed at the company's facility in Almaden, is to be contributed to Apache Spark. IBM System ML http://www.theinquirer.net/inquirer/news/2413132/ibm-donates-machine-learning-tech-to-apache-spark- open-source-community
  • 9. • Apache Flink is an open source platform for scalable batch and stream data processing. • Focused on large-scale data analytics • Unified real-time stream and batch processing • Expressive and rich APIs in Java / Scala (+Python) • Robust and fast execution backed • Some similar system storm? spark? … Apache Flink https://flink.apache.org/index.html
  • 10. Python Data Science stack https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote
  • 11. Ibis on Impala: Python at Scale for Data Science http://blog.cloudera.com/blog/2015/07/ibis-on-impala-python-at-scale-for-data-science/ http://www.ibis-project.org/
  • 12. • PredictionIO is an open source machine learning framework for developers and data scientists. It supports event collection, deployment of algorithms, evaluation, querying predictive results via REST APIs. • Built on Apache Spark, HBase and Spray. PredictionIO https://github.com/PredictionIO/PredictionIO/
  • 13. • Splash is a general framework for parallelizing stochastic learning algorithms on multi-node clusters. • Splash is built on Scala and Apache Spark Splash http://zhangyuc.github.io/splash/
  • 14. • Discriminative, low-dimensional, large-scale featurization – the easy way • A new way to process categorical features. It is a scaling way. Learning with Counts
  • 16. •Three small wishes of data science. •Is more data always better? •Are more variables always better? •Can’t we keep at least some of our training performance? Statistics in the age of data science, issues you can not ignore
  • 20. Other Interesting Sessions (Didn’t attend) • Applying data science to sales pipelines — for fun and profit http://es.slideshare.net/dato-inc/applying- data-science-to-sales-pipelines-for-fun-and-profit-51025224 • What we've learned from over 1MM machine learning models • DeepLearning4J: Open Source Neural Net Platform • Scalable on Hadoop, Spark and Akka + AWS et al • http://deeplearning4j.org/ • Five key learnings from building a large scale anomaly detection system http://www.slideshare.net/dato- inc/five-key-learnings-from-building-a-large-scale-anomaly-detection-system

Editor's Notes

  • #11: Bokeh http://bokeh.pydata.org/en/latest/ Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. dask https://github.com/ContinuumIO/dask Dask enables parallel computing through task scheduling and blocked algorithms.