SlideShare a Scribd company logo
@MargrietGr
Margriet Groenendijk
Developer Advocate for IBM Cloud Data Services
SW Cloud meetup
Bristol
24 November 2016
Data Science in the Cloud
@MargrietGr
About me
• Developer Advocate at IBM Cloud Data Services, UK
•Data science
•Python, Spark, R, Cloudant, dashDB
• Research Fellow at University of Exeter, UK
•Worked with very large observational datasets and
the output of global scale climate models
• PhD at Vrije Universiteit Amsterdam, the Netherlands
•Explored large observational datasets of carbon
uptake by forests
@MargrietGr
1781
http://visual.ly/exports-and-imports-scotland
@MargrietGr
1821
https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png
@MargrietGr
1960s
http://www.computerhistory.org/collections/catalog/102630767
@MargrietGr
1960s
http://www.climatecentral.org/news/first-climate-model-video-19007
@MargrietGr
Data
Engineers
Data
Scientists
Business
Analysts
App
Developers
Data Science is a Team Effort
Data
@MargrietGr
Toolbox
http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png
@MargrietGr
Data Science Workflow
@MargrietGr
Discover
Data
Use	
Data
Publish Data
Socialize	
Data
Data Science Workflow
@MargrietGr
Data Science
Workflow
Define	Question
Find	Data
Explore	Data
Clean	Data
Visualize	and	
Summarize	Data
Create	Predictive	
Models
Present	Results
@MargrietGr
Collect Data
APIs
Open Data
Maps
Web Scraping
Time Series
@MargrietGr
Store Data
Object Store - binary files
Relational database
Document store - json
Bluemix
https://console.ng.bluemix.net/
@MargrietGr
Explore Data
@MargrietGr
Explore	Data
Clean	DataStore	Data
@MargrietGr
Spark on a
Cluster
@MargrietGr
The Spark Stack
from Karau et al.: Learning Spark
@MargrietGr
RDDs : Resilient Distributed Datasets
• Data does not have to fit on a single machine
• Data is separated into partitions
• Creation of RDDs
•Load an external dataset
•Distribute a collection of objects
• Transformations construct a new RDD from a previous one (lazy!)
• Actions compute a result based on an RDD
@MargrietGr
Run Spark locally in a Python notebook
https://www.continuum.io/downloads
http://spark.apache.org/downloads.html
Create a new kernel to use in a
Jupyter notebook
@MargrietGr
Jupyter Notebooks!
• Server-client application to edit and run
notebook documents via a web browser
• Cells with:
•Code
•Figures and tables
•Rich text elements
• Different kernels: Python, R, Scala,
Spark
In the Cloud:
@MargrietGr
http://datascience.ibm.com/
@MargrietGr
@MargrietGr
@MargrietGr
@MargrietGr
Weather Data
@MargrietGr
Define Question
What will the weather be next weekend?
https://unsplash.com/search/autumn?photo=LSF8WGtQmn8
https://unsplash.com/search/rain?photo=19tQv51x4-A
@MargrietGr
Find Data
https://console.ng.bluemix.net/
@MargrietGr
Explore Data
Python packages
• requests and json
•API credentials and latitude/longitude of Bristol
•json data returned
• pandas, numpy and datetime
•convert json to pandas DataFrame (table with multiple indices)
•add time as index
@MargrietGr
Weather forecast for
Bristol
https://developer.ibm.com/
clouddataservices/2016/10/06/
your-own-weather-forecast-in-a-
python-notebook/
Visualize Data
Python packages
• pandas - rolling mean
• matplotlib
• Basemap
Demo
@MargrietGr
Weather map
https://
developer.ibm.com/
clouddataservices/
2016/10/06/your-own-
weather-forecast-in-a-
python-notebook/
Python packages
• matplotlib
• Basemap
• itertools
• urllib
@MargrietGr
@MargrietGr
@MargrietGr
Weather,Twitter and Sentiment
@MargrietGr
Weather, Twitter and Sentiment
• Where to find the data?
• Where to store the data?
• Where to analyse the data?
• Quick tools to explore
@MargrietGr
Insights for Twitter
@MargrietGr
Add sentiment - example
@MargrietGr
• watson tone analyser
Emotion
Language
style
Social
propensities
Analyze how you are coming across to others
@MargrietGr
Workflow
Weather Company
Data
crontab -e
0 23 * * * /path/to/file/do_something.sh
python do_something.py
Tweets
Weather
Sentiment
Watson Tone Analyser
Insights for Twitter
Cloudant NoSQL
@MargrietGr
PixieDust
https://github.com/ibm-cds-labs/pixiedust
Simpler Workflow
@MargrietGr
PixieDust: an Open Source Library that simplifies and
improves Jupyter Python Notebooks
• PackageManager
• Visualizations
• Cloud Integration
• Scala Bridge
• Extensibility
• Embedded Apps
https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-
python-notebook/
@DTAIEB55
@MargrietGr
Install Spark packages or plain jars in your Notebook Python
kernel without the need to modify configuration file
Uses the GraphFrame Python APIs
Install GraphFrames Spark Package
@MargrietGr
One simple API: display()
Call the Options dialog
Panning/Zooming
options
Performance statistics
@MargrietGr
Easily export your data to csv, json, html, etc. locally on your laptop
or into a cloud-based service like Cloudant or Object Storage
@MargrietGr
Scala Bridge
Define a Python variable
Use the Python var in Scala
Define a Scala variable
Use the Scala var in Python
@MargrietGr
Easily extend PixieDust to create your own visualizations
using HTML/CSS/JavaScript
Customized
Visualization for
GraphFrame
Graphs
@MargrietGr
Encapsulate your analytics into compelling User
Interfaces better suited for Line of Business Users
@MargrietGr
@MargrietGr
IBM Watson Data Platform
• Data Science Experience
• Watson Data Platform
• Machine Learning
• Sign up for beta: http://datascience.ibm.com/features#machinelearning
@MargrietGr
@MargrietGr
https://developer.ibm.com/clouddataservices/author/
mgroenen/
Thanks!
Slides will be here:
http://www.slideshare.net/MargrietGroenendijk

More Related Content

PDF
Cloud architectures for data science
PDF
Big Data Analytics London - Data Science in the Cloud
PDF
PDF
Introduction to the IBM Watson Data Platform
PDF
Graph Computing with JanusGraph
PDF
Beginners guide to weather and climate data
PDF
Exploring Graph Use Cases with JanusGraph
PDF
ODSC UK 2016: How To Analyse Weather Data and Twitter Sentiment with Spark an...
Cloud architectures for data science
Big Data Analytics London - Data Science in the Cloud
Introduction to the IBM Watson Data Platform
Graph Computing with JanusGraph
Beginners guide to weather and climate data
Exploring Graph Use Cases with JanusGraph
ODSC UK 2016: How To Analyse Weather Data and Twitter Sentiment with Spark an...

What's hot (20)

PDF
Graph Computing with Apache TinkerPop
PDF
Graph Computing with JanusGraph
PPTX
Janus graph lookingbackwardreachingforward
PDF
JanusGraph, Jupyter Meetup NYC
PPTX
Analysing GitHub commits with R
PPTX
GitHub Data and Insights
PDF
JanusGraph: Looking Backward, Reaching Forward
PDF
Airline Reservations and Routing: A Graph Use Case
PPTX
Powers of Ten Redux
PDF
Community-Driven Graphs with JanusGraph
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PDF
Making it easy to work with data
PDF
Building Open Data Lakes on AWS with Debezium and Apache Hudi
PDF
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
PPTX
Scaling collaborative data science with Globus and Jupyter
PPTX
OSCON 2015
PDF
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
PDF
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
PDF
Curse of Cardinality: A History and Evolution of Monitoring at Scale
Graph Computing with Apache TinkerPop
Graph Computing with JanusGraph
Janus graph lookingbackwardreachingforward
JanusGraph, Jupyter Meetup NYC
Analysing GitHub commits with R
GitHub Data and Insights
JanusGraph: Looking Backward, Reaching Forward
Airline Reservations and Routing: A Graph Use Case
Powers of Ten Redux
Community-Driven Graphs with JanusGraph
Analysing GitHub commits with R
Analysing GitHub commits with R
Making it easy to work with data
Building Open Data Lakes on AWS with Debezium and Apache Hudi
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
Scaling collaborative data science with Globus and Jupyter
OSCON 2015
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Curse of Cardinality: A History and Evolution of Monitoring at Scale
Ad

Similar to Data Science in the Cloud (20)

PDF
From Developer to Data Scientist - Gaines Kergosien
PDF
The Convergence of Data Science and Software Development
PPTX
Conf 2018 Track 3 - Creating marine geospatial services
PDF
Big data internship plan at Contemi Vietnam
PDF
Processing Twitter Stream with Oracle Event Processing (OEP)
PDF
Architecture of Big Data Solutions
PDF
Semantische Technologien (nicht nur) für die verbesserte Suche in SharePoint
PDF
Big Data Architectures
PDF
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
PDF
Processing Twitter Stream with Oracle Event Processing (OEP)
PDF
IP EXPO Europe: Data Science in the Cloud
PPTX
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
PPTX
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
PDF
The convergence of Data Science and Software Development
PDF
The Convergence of Data Science and Software Development
PDF
IP EXPO Nordic: Data Science in the Cloud
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PPTX
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
PPTX
Breed data scientists_ A Presentation.pptx
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
From Developer to Data Scientist - Gaines Kergosien
The Convergence of Data Science and Software Development
Conf 2018 Track 3 - Creating marine geospatial services
Big data internship plan at Contemi Vietnam
Processing Twitter Stream with Oracle Event Processing (OEP)
Architecture of Big Data Solutions
Semantische Technologien (nicht nur) für die verbesserte Suche in SharePoint
Big Data Architectures
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
Processing Twitter Stream with Oracle Event Processing (OEP)
IP EXPO Europe: Data Science in the Cloud
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...
The convergence of Data Science and Software Development
The Convergence of Data Science and Software Development
IP EXPO Nordic: Data Science in the Cloud
Big Data Architectures @ JAX / BigDataCon 2016
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
Breed data scientists_ A Presentation.pptx
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Ad

More from Margriet Groenendijk (13)

PDF
Trusting machines with robust, unbiased and reproducible AI
PDF
Trusting machines with robust, unbiased and reproducible AI
PDF
Trusting machines with robust, unbiased and reproducible AI
PDF
Weather and Climate Data: Not Just for Meteorologists
PDF
Navigating the Magical Data Visualisation Forest
PDF
The Convergence of Data Science and Software Development
PDF
Weather and Climate Data: Not Just for Meteorologists
PDF
ODSC Europe: Weather and Climate Data: Not Just for Meteorologists
PDF
PyParis - weather and climate data
PDF
PyData Barcelona - weather and climate data
PDF
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
PDF
Data Science Festival - Beginners Guide to Weather and Climate Data
PDF
Connecting and Visualising Open Data from Multiple Sources
Trusting machines with robust, unbiased and reproducible AI
Trusting machines with robust, unbiased and reproducible AI
Trusting machines with robust, unbiased and reproducible AI
Weather and Climate Data: Not Just for Meteorologists
Navigating the Magical Data Visualisation Forest
The Convergence of Data Science and Software Development
Weather and Climate Data: Not Just for Meteorologists
ODSC Europe: Weather and Climate Data: Not Just for Meteorologists
PyParis - weather and climate data
PyData Barcelona - weather and climate data
GeoPython - Mapping Data in Jupyter Notebooks with PixieDust
Data Science Festival - Beginners Guide to Weather and Climate Data
Connecting and Visualising Open Data from Multiple Sources

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Introduction to Business Data Analytics.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Global journeys: estimating international migration
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
Fluorescence-microscope_Botany_detailed content
Moving the Public Sector (Government) to a Digital Adoption
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Mega Projects Data Mega Projects Data
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Knowledge Engineering Part 1
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Business Data Analytics.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
climate analysis of Dhaka ,Banglades.pptx
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Global journeys: estimating international migration
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu

Data Science in the Cloud