SlideShare a Scribd company logo
REPRODUCIBLE RESEARCH AT
SCALE WITH APACHE SPARK
AND ZEPPELIN NOTEBOOK
CAROLYN DUBY
SOLUTIONS ENGINEER, NORTHEAST
HORTONWORKS
@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
Boston | May 3-5th
ABOUT CAROLYN DUBY
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
• Innovation through data
• Enterprise ready, 100% open source, modern data platforms
• Engineering, Technical Support, Professional Services, Training
https://www.meetup.com/futureofdata-
boston/
AGENDA
• What is Reproducible Research? Why do it?
• What does it take to do Reproducible Research at scale?
• Example with Apache Zeppelin and Spark
REPRODUCIBLE RESEARCH
• Complete details of data analysis
methods yielding conclusions
• Replication of research on
independently collected data
• Gold standard
BENEFITS
• Individual productivity
• Effective peer review
• Answer questions more quickly
• Correct errors
• Apply methods to other experiments
• Increased quality and respect for results
• Justify business decisions
CHALLENGES
• Large data sets
• Complex analysis
• Data lineage
• Streaming data
• Limited space in publications
HOW TO DO REPRODUCIBLE RESEARCH
• Define Platform
• Record all versions of analysis software and installation procedures
• Analyze Data
• Record all commands to acquire, clean, organize, analyze
• Store intermediate results
• Version control
• Share Methods and Results
• Publish
• Share full details
REPRODUCIBLE RESEARCH WITH APACHE
OPEN SOURCE
• Apache Spark version 2.1
• Cleaning and analysis of large data sets
• http://spark.apache.org
• Apache Zeppelin Notebook 0.7.0
• Capture automated commands
• Visualize data for exploration and results
• https://zeppelin.apache.org
APACHE SPARK
• Distributed processing efficiently crunches large data sets
• Optimized
• Horizontally scalable with multi tenancy
• Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS, Hive,
Phoenix, S3, etc
SPARK LIBRARIES
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
• Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
• Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
• Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
• Complex event processing and data ingest
ZEPPELIN
• Notebook
• Combine mark down, shell, spark, sql commands in same notebook
• Easily integrate with Spark in different languages
• Visualize data using graphs and pivot charts
• Share notebooks or paragraphs
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ARCHITECTURE
Spark Driver
Zeppelin Spark
Application
Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Client Browser
GETTING STARTED
• Use a distribution
• Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
• https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
• http://community.hortonworks.com
• On premise
• Use Apache Ambari to manage on premise physical hardware
• Cloud
• Automated provisioning with Cloudbreak (https://github.com/sequenceiq/cloudbreak)
• AWS, Azure, Google Cloud
ZEPPELIN BASICS
• Notes are composed of paragraphs
• Paragraph contains code or markdown
• Specify interpreter - % <interpreter name> or blank for default
• Enter commands
• Click play button to run code on cluster
• Results display in paragraph
• Code and results can be shown or hidden
Create/open
Note
Note tools
Paragraph
tools
User and note
configuration
Markdown
Interpreter (%md)
(editor hidden)
Shell
Interpreter (%sh)
(editor shown)
MARKDOWN
# headers
%md
hyperlink
show/hide
editor
run paragraph
run all paragraphs
block quote
EXAMPLE
• Crimes in Chicago Kaggle
Dataset
• Interesting opportunities
for time series and
prediction
https://www.kaggle.com/currie32/crimes-in-chicago
DATA PIPELINE
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze
OPTIMIZING DATA CLEANING
• Keep a raw copy
• Web sites go away, remove data, change links and interfaces
• Store the clean data
• Saves time each time you analyze
• Use a standard format (Optimized Row Columnar(ORC),
parquet, etc)
• Query data with hive
• Shared location if security and privacy requirements allow
• Collaborate by sharing data with others
ACQUIRE DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze
DOCUMENTING PLATFORM PREPARATION
DOCUMENT VERSIONS
• %sh interpreter
• Bash shell
• Show
intermediate
results for debug
CLEAN DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Switching
To spark
Scala
code
SPARK IS FAST BUT LAZY
• Transformations
• Specify which data to read
• Modify data
• Actions
• Show data
• Write data
Header and
Case data on
Same CSV line
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Apply numeric
types
On clean data
Add some
columns to make
aggregations
easier
Table for SQL
Save clean
data as ORC
EXPLORE DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze
Read clean
data and
create table
Specify query
Select visualization
Configure visualization
X
Y
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Hover to see
values
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
OTHER INTERPRETERS
• Matplotlib with pyspark
• Angular – maps
ANGULAR WITH INPUT
https://community.hortonworks.com/articles/75834/using-angular-within-apache-
zeppelin-to-create-cus.html
ANALYZE DATASET
Acquire
Kaggle
Common Store
Raw CSV
zip
Clean ORC
Clean
Explore
Analyze
CREATE DATA TO FIT POISSON
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
FIT POISSON MODEL
Evaluate
Model
MODEL PIPELINES
• https://spark.apache.org/docs/2.1.0/ml-pipeline.html
TRANFORME
RTransformers Estimator
Pipeline
Model
Training
data
Test
data
Predictio
ns
PYTHON – LOOKS LIKE SCALA
R
TIPS AND TRICKS
• Use val for variables used across paragraphs
• Vars can yield unpredictable results when run out of order
• Break up big notebooks
• Store intermediate results
• Avoid reloading and recalculating the same values
• Verify your notebook by running all paragraphs
SHARING NOTEBOOKS
• Share link to notebook or paragraph
• Readers access your Zeppelin server
• Use logins and permissions
• Export to JSON and save to shared file
• Readers get JSON from shared file (github, cloud, etc)
• Import to their Zeppelin server
• Sync your to Zeppelin Hub (https://www.zeppelinhub.com)
• Share Zeppelin Hub link with readers
• Free version for small teams
REUSING NOTEBOOKS
• Clone notebook
• Copy code from notebook
• Build libraries for use in notebooks
VERSIONING NOTEBOOKS
• Track changes to notebook code or text
• Go back to a previously known good version
• Compare versions to see differences
CONFIGURE ZEPPELIN VERSION CONTROL
1
2
3
4
5Set storage to
GitNotebookRepo in
Zeppelin-env.xml
Restart Zeppelin server
SAVE NOTE VERSION
Enter version
comment and click
Commit
VIEWING VERSION HISTORY AND PREVIOUS
VERSIONS
Pull down the list of
versions
Select a version
Zeppelin shows
content for that
version
Head goes to the
latest version
GIT REPO ON ZEPPELIN SERVER
Zeppelin creates git repo
In notebook directory
QUESTIONS AND THANK YOU!
REFERENCES
REPRODUCIBLE RESEARCH
• Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple
Rules for Reproducible Computational Research. PLoS Comput
Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285
• http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.p
cbi.1003285&type=printable
ZEPPELIN AND SPARK
• Spark
• https://dzone.com/articles/try-the-latest-innovations-in-apache-
spark-and-apa
• https://hortonworks.com/hadoop-tutorial/learning-spark-zeppelin/
• https://spark.apache.org/docs/2.1.0/ml-pipeline.html
• Example Notebooks
• https://github.com/hortonworks-gallery/zeppelin-notebooks
ZEPPELIN INTERPRETERS
• Markdown syntax
• http://daringfireball.net/projects/markdown/syntax
EXAMPLE
• Chicago Crimes Data Set
• https://www.kaggle.com/currie32/crimes-in-Chicago
• Example notebooks
• https://github.com/carolynduby/ODSC2017

More Related Content

PPTX
Data Science with Spark & Zeppelin
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
PDF
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
PPTX
Lambda architecture: from zero to One
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Data Science with Spark & Zeppelin
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Lambda architecture: from zero to One
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

What's hot (20)

PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PPTX
Zeppelin at Twitter
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PDF
Informational Referential Integrity Constraints Support in Apache Spark with ...
PDF
Elasticsearch + Cascading for Scalable Log Processing
PPTX
Databricks @ Strata SJ
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Spark Summit EU talk by Stephan Kessler
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
H2O Rains with Databricks Cloud - Parisoma SF
PDF
The Revolution Will be Streamed
PDF
Using Databricks as an Analysis Platform
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
PDF
Spark Summit EU talk by Christos Erotocritou
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Zeppelin at Twitter
Announcing Databricks Cloud (Spark Summit 2014)
H2O World - H2O Rains with Databricks Cloud
Writing Continuous Applications with Structured Streaming PySpark API
Open Source Big Data Ingestion - Without the Heartburn!
Informational Referential Integrity Constraints Support in Apache Spark with ...
Elasticsearch + Cascading for Scalable Log Processing
Databricks @ Strata SJ
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Spark Summit EU talk by Stephan Kessler
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
How We Optimize Spark SQL Jobs With parallel and sync IO
H2O Rains with Databricks Cloud - Parisoma SF
The Revolution Will be Streamed
Using Databricks as an Analysis Platform
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Spark Summit EU talk by Christos Erotocritou
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Ad

Similar to ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark (20)

PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
PDF
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
PDF
Big Data visualization with Apache Spark and Zeppelin
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Architecting an Open Source AI Platform 2018 edition
PDF
Simple Apache Spark Introduction - Part 2
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
PDF
Introduction to Apache Spark
PDF
Apache Spark Tutorial
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Are general purpose big data systems eating the world?
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Data Science at Scale with Apache Spark and Zeppelin Notebook
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Big Data visualization with Apache Spark and Zeppelin
2016-07-21-Godil-presentation.pptx
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Architecting an Open Source AI Platform 2018 edition
Simple Apache Spark Introduction - Part 2
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Introduction to Apache Spark
Apache Spark Tutorial
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark for Everyone - Women Who Code Workshop
Are general purpose big data systems eating the world?
Simplifying Big Data Analytics with Apache Spark
From Pipelines to Refineries: Scaling Big Data Applications
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Ad

Recently uploaded (20)

PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Logistic Regression ml machine learning.pptx
PPTX
A Quantitative-WPS Office.pptx research study
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Global journeys: estimating international migration
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Lecture1 pattern recognition............
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Logistic Regression ml machine learning.pptx
A Quantitative-WPS Office.pptx research study
Clinical guidelines as a resource for EBP(1).pdf
Fluorescence-microscope_Botany_detailed content
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Global journeys: estimating international migration
Moving the Public Sector (Government) to a Digital Adoption
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
Lecture1 pattern recognition............
Major-Components-ofNKJNNKNKNKNKronment.pptx

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark