SlideShare a Scribd company logo
IBM Cloud Data Services
data science toolkit 101
set up Python, Spark, & Jupyter
Raj Singh, PhD
Developer Advocate: Geo | Open Data
rrsingh@us.ibm.com
http://ibm.biz/rajrsingh
twitter: @rajrsingh
@rajrsingh
IBM Cloud Data Services
Agenda
• Installation
• Python
• Spark
• Pixiedust
• Examples
@rajrsingh
IBM Cloud Data Services
IBM Analytics
Data Science Experience (DSX)
@rajrsingh
IBM Cloud Data Services
What is Spark?
• In-memory Hadoop
• Hadoop was massively scalable but slow
• “Up to 100x faster” (10x faster if memory is exhausted)
• What is Hadoop?
• HDFS: fault-tolerant storage using horizontally scalable commodity hardware
• MapReduce: programming style for distributed processing
• Presents data as an object
independent of the
underlying storage
@rajrsingh
IBM Cloud Data Services
Spark abstracted storage
• Scala
• PySpark = (Spark + Python)
• Drivers
• File storage
• Cloudant
• dashDB
• Cassandra
• …
@rajrsingh
IBM Cloud Data Services
Python installation with miniconda
1. https://www.continuum.io/downloads (choose version 2.7)
2. Miniconda2 install into this location: /Users/<username>/miniconda2
3. bash$ conda install pandas jupyter matplotlib
4. bash$ which python
/Users/<username>/miniconda2/bin/python
https://dzone.com/refcardz/apache-spark
@rajrsingh
IBM Cloud Data Services
Spark installation
• http://spark.apache.org/downloads.html
• Spark release: 1.6.2
• package type: Pre-built for Hadoop 2.6
• mkdir dev
• cd dev
• tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz
• ln -s spark-1.6.2-bin-hadoop2.6 spark
• mkdir dev/notebooks
@rajrsingh
IBM Cloud Data Services
PySpark configuration
• create directory ~/.ipython/kernels/pyspark1.6/
• create file kernel.json
• cd ~/dev/spark/conf
• cp spark-defaults.conf.template spark-defaults.conf
• add to end of spark-defaults.conf:
spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/*
{
"display_name": "pySpark (Spark 1.6.2) Python 2",
"language": "python",
"argv": [
"/Users/sparktest/miniconda2/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/Users/sparktest/dev/spark",
"PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip",
"PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell",
"SPARK_DRIVER_MEMORY": "10G",
"SPARK_LOCAL_IP": "127.0.0.1"
}
}
@rajrsingh
IBM Cloud Data Services
PySpark test
• bash$ cd ~/dev
• bash$ jupyter notebook
• upper right of the Jupyter screen, click New, choose
pySpark (Spark 1.6.2) Python 2
(or whatever name specified in your kernel.json file)
• in the notebook's first cell enter sc.version
and click the >| button to run it (or hit CTRL + Enter).
@rajrsingh
IBM Cloud Data Services
Pixiedust installation
• cd ~/dev
• git clone https://github.com/ibm-cds-labs/pixiedust.git
• pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust
• pip install maven-artifact
• pip install mpld3
@rajrsingh
IBM Cloud Data Services
Examples
• Pixiedust
• https://github.com/ibm-cds-labs/pixiedust
• Demographic analyses
• http://ibm-cds-labs.github.io/open-data/samples/
• or https://github.com/ibm-cds-labs/open-data/tree/master/samples
IBM Cloud Data Services
Raj Singh
Developer Advocate: Geo | Open
Data
rrsingh@us.ibm.com
http://ibm.biz/rajrsingh
Twitter: @rajrsingh
LinkedIn: rajrsingh
Thanks

More Related Content

PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
PySaprk
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PDF
PySpark Best Practices
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Spark Summit EU talk by Miklos Christine paddling up the stream
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PySaprk
Apache Spark: The Next Gen toolset for Big Data Processing
Integrating Existing C++ Libraries into PySpark with Esther Kundin
PySpark Best Practices

What's hot (20)

PDF
Scalable Data Science with SparkR
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PDF
How To Connect Spark To Your Own Datasource
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
Operational Tips for Deploying Spark
PPTX
ETL with SPARK - First Spark London meetup
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PPTX
Programming in Spark using PySpark
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
PDF
Spark Meetup at Uber
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
PDF
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Apache Arrow and Pandas UDF on Apache Spark
Scalable Data Science with SparkR
Spark Under the Hood - Meetup @ Data Science London
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Keeping Spark on Track: Productionizing Spark for ETL
How To Connect Spark To Your Own Datasource
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Operational Tips for Deploying Spark
ETL with SPARK - First Spark London meetup
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Programming in Spark using PySpark
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Spark Meetup at Uber
Jump Start on Apache® Spark™ 2.x with Databricks
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Apache Arrow and Pandas UDF on Apache Spark
Ad

Viewers also liked (20)

PDF
Cassandra and Spark
PDF
Introduction to Apache Spark
PPTX
Presentation of Apache Cassandra
PDF
Introduction to Cassandra - Denver
KEY
Cassandra Basics: Indexing
KEY
Developers summit cassandraで見るNoSQL
PDF
Intro to py spark (and cassandra)
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
PDF
Python & Cassandra - Best Friends
PDF
Intro to Cassandra
PDF
The Cassandra Distributed Database
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PPT
Parquet overview
PDF
Cassandra Summit 2010 Performance Tuning
PDF
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
PPTX
Cassandra's Sweet Spot - an introduction to Apache Cassandra
PDF
Data analysis with Pandas and Spark
PDF
Spark, Python and Parquet
PDF
Python performance profiling
PPTX
Cassandra concepts, patterns and anti-patterns
Cassandra and Spark
Introduction to Apache Spark
Presentation of Apache Cassandra
Introduction to Cassandra - Denver
Cassandra Basics: Indexing
Developers summit cassandraで見るNoSQL
Intro to py spark (and cassandra)
Diagnosing Problems in Production: Cassandra Summit 2014
Python & Cassandra - Best Friends
Intro to Cassandra
The Cassandra Distributed Database
PySpark Cassandra - Amsterdam Spark Meetup
Parquet overview
Cassandra Summit 2010 Performance Tuning
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Data analysis with Pandas and Spark
Spark, Python and Parquet
Python performance profiling
Cassandra concepts, patterns and anti-patterns
Ad

Similar to data science toolkit 101: set up Python, Spark, & Jupyter (20)

PDF
Hands on with Apache Spark
PPTX
Dask: Scaling Python
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
PYSPARK PROGRAMMING.pdf
PPTX
Paris Data Geek - Spark Streaming
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Apache Spark Tutorial
PPTX
Introduction to Apache Spark
PDF
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
PDF
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
PDF
Ingesting hdfs intosolrusingsparktrimmed
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
HDPCD Spark using Python (pyspark)
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PDF
Hands on with Apache Spark
Dask: Scaling Python
Apache Spark for Everyone - Women Who Code Workshop
PYSPARK PROGRAMMING.pdf
Paris Data Geek - Spark Streaming
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Apache Spark Tutorial
Introduction to Apache Spark
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Hadoop in Practice (SDN Conference, Dec 2014)
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Ingesting hdfs intosolrusingsparktrimmed
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark
Intro to Apache Spark
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
HDPCD Spark using Python (pyspark)
Apache Spark™ is a multi-language engine for executing data-S5.ppt

More from Raj Singh (11)

PPTX
Optimizing location-based apps with open data
PPTX
All your database are belong to us - Koop, Cloudant, Feature Services
PPTX
Field Work: Map-centric mobile apps with Cloudant Geo and LeafletJS
PPTX
Painless Polyglot Persistence
PPTX
The Evolution of Mobile Mapping
PPTX
The NoSQL Geospatial Landscape
PDF
JSON Everywhere
PPTX
GeoPackage, OWS Context and the OGC Interoperability Program
PPTX
IoT Meets Geo
PPTX
GeoPackage, Context and POI (and a sprinkle of GeoJSON)
PPTX
Introduction to GeoPackage and OWS Context
Optimizing location-based apps with open data
All your database are belong to us - Koop, Cloudant, Feature Services
Field Work: Map-centric mobile apps with Cloudant Geo and LeafletJS
Painless Polyglot Persistence
The Evolution of Mobile Mapping
The NoSQL Geospatial Landscape
JSON Everywhere
GeoPackage, OWS Context and the OGC Interoperability Program
IoT Meets Geo
GeoPackage, Context and POI (and a sprinkle of GeoJSON)
Introduction to GeoPackage and OWS Context

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf

data science toolkit 101: set up Python, Spark, & Jupyter

  • 1. IBM Cloud Data Services data science toolkit 101 set up Python, Spark, & Jupyter Raj Singh, PhD Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh twitter: @rajrsingh
  • 2. @rajrsingh IBM Cloud Data Services Agenda • Installation • Python • Spark • Pixiedust • Examples
  • 3. @rajrsingh IBM Cloud Data Services IBM Analytics Data Science Experience (DSX)
  • 4. @rajrsingh IBM Cloud Data Services What is Spark? • In-memory Hadoop • Hadoop was massively scalable but slow • “Up to 100x faster” (10x faster if memory is exhausted) • What is Hadoop? • HDFS: fault-tolerant storage using horizontally scalable commodity hardware • MapReduce: programming style for distributed processing • Presents data as an object independent of the underlying storage
  • 5. @rajrsingh IBM Cloud Data Services Spark abstracted storage • Scala • PySpark = (Spark + Python) • Drivers • File storage • Cloudant • dashDB • Cassandra • …
  • 6. @rajrsingh IBM Cloud Data Services Python installation with miniconda 1. https://www.continuum.io/downloads (choose version 2.7) 2. Miniconda2 install into this location: /Users/<username>/miniconda2 3. bash$ conda install pandas jupyter matplotlib 4. bash$ which python /Users/<username>/miniconda2/bin/python https://dzone.com/refcardz/apache-spark
  • 7. @rajrsingh IBM Cloud Data Services Spark installation • http://spark.apache.org/downloads.html • Spark release: 1.6.2 • package type: Pre-built for Hadoop 2.6 • mkdir dev • cd dev • tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz • ln -s spark-1.6.2-bin-hadoop2.6 spark • mkdir dev/notebooks
  • 8. @rajrsingh IBM Cloud Data Services PySpark configuration • create directory ~/.ipython/kernels/pyspark1.6/ • create file kernel.json • cd ~/dev/spark/conf • cp spark-defaults.conf.template spark-defaults.conf • add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/* { "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" } }
  • 9. @rajrsingh IBM Cloud Data Services PySpark test • bash$ cd ~/dev • bash$ jupyter notebook • upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file) • in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).
  • 10. @rajrsingh IBM Cloud Data Services Pixiedust installation • cd ~/dev • git clone https://github.com/ibm-cds-labs/pixiedust.git • pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust • pip install maven-artifact • pip install mpld3
  • 11. @rajrsingh IBM Cloud Data Services Examples • Pixiedust • https://github.com/ibm-cds-labs/pixiedust • Demographic analyses • http://ibm-cds-labs.github.io/open-data/samples/ • or https://github.com/ibm-cds-labs/open-data/tree/master/samples
  • 12. IBM Cloud Data Services Raj Singh Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh Twitter: @rajrsingh LinkedIn: rajrsingh Thanks