SlideShare a Scribd company logo
Spark + IPython
The remix
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
27 March 2015 at @Itnig with @pybcn
Index
● Motivation
● Walkthrough
● Demo
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
A little about me
● Guillermo Blasco
● Graduate in Mathematics and Software
Engineering
● Developing theblackbox.io
● Working as Data Scientist
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Spark, What?
● Distributed computation engine
● Based on Resilient Distributed Dataset
● Runs on JVM, but available from Java, Scala
and Python
● Open Source
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Spark, Why?
● Mainly, scalability in terms of
○ Commodity costs
○ Computation time
○ Dataset size
● Hadoop was hard to maintain
● MapReduce is a computational pattern
● RDD is a distributed data model
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
IPython, What?
● Interactive computing framework
● “python with batteries”
● Open Source
● Expanding to other languages (Jupyter)
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
IPython, Why?
● Powerful interactive remote shells
○ Terminal
○ Qt
○ Notebook
● Easy data visualization
● Configurable in cluster and in parallel
● Embeddable, flexible, extensible
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Wait...
● IPython is cluster configurable
● Spark has an interactive Scala and Python
shell
¿Are they not pretty much the same?
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Well, not at all
Gin and Vodka
are not the same
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Spark + IPython, Why?
● Spark is the leading general purpose
distributed computational system today, in
terms of productive performance.
● IPython is great to experiment and develop
scientific applications.
Mix them together to get the best of both.
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
So, what is the goal?
● Connecting your IPython environment to a
Spark cluster powers your development to
process even larger data
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
And an extra benefit...
Since Spark is production ready, you just have
to export* your IPython project to a python
script. Meaning:
● No code translations to production
environment
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Before mixing it up, understand
Spark
● Slave-Master-Client
IPython
● (Cluster-)Master-Client
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Spark architecture
● Master node coordinates distribution and
resilience of RDD.
● Slave nodes compute the operations over
RDDs.
● Client nodes connect to master to request
computations. theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
IPython architecture
● Master node with a kernel (computation unit)
● Slave nodes handle computations tagged as
distributed (%px)
● Client nodes connect to master to request
computations.
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
● Configure Spark cluster
● Link IPython kernel to one Spark context
● Use IPython clients to develop scripts with
Spark
The plan
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Hands On!
Let’s drink
Gin with Vodka
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
https://github.com/theblackboxio/spark-ipython
Conclusions
● Computational power of Spark
● Interactiveness of IPython
● Viable, not that hard to configure
● Also fun
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Complexities
● Sysadmin work
● Python dependencies
Thanks!
Questions?
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Thanks to:
Python BCN (@pybcn)
Itnig (@Itnig)

More Related Content

PDF
Scalable Clusters On Demand
PDF
Kubernetes Config Management Landscape
PPTX
SC4 Hangout - Luigi Selmi, Transport pilot architecture
PPTX
Exploring linked data in r
PPTX
Accelerating NLP with Dask and Saturn Cloud
PDF
Industrializing Machine learning pipelines
PPTX
BDE SC4 Hangout - Hajira Jabeen, general architecture
PPTX
BeakerX - Tiezheng Li
Scalable Clusters On Demand
Kubernetes Config Management Landscape
SC4 Hangout - Luigi Selmi, Transport pilot architecture
Exploring linked data in r
Accelerating NLP with Dask and Saturn Cloud
Industrializing Machine learning pipelines
BDE SC4 Hangout - Hajira Jabeen, general architecture
BeakerX - Tiezheng Li

What's hot (20)

PDF
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
PPTX
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
PDF
Kafka and GraphQL: Misconceptions and Connections | Gerard Klijs, Open Web
PPTX
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
PPTX
David McKay [InfluxData] | Git Lost in Time Series | InfluxDays Virtual Exper...
PPT
SC5 Hangout2 pilot 1 description
PPTX
BeakerX Beaker Extensions for Jupyter
PDF
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
PDF
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
PDF
Collaborative data science and how to build a data science toolchain around n...
PDF
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
PDF
Functional APIs with Absinthe GraphQL
PDF
Introduction to DevOps and the Practical Use Cases at Credit OK
PDF
Trondheim Eclipe Day 2015 and 2016
PDF
Nikhil summer internship 2016
PDF
Google APAC Machine Learning Expert Day
PDF
How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
PDF
Drupal Brisbane Meetup :: Drupal in late 2017-2018
PDF
Introducing MagnetoDB, a key-value storage sevice for OpenStack
PDF
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Ryan Betts [InfluxData] | InfluxDB Platform Performance | InfluxDays Virtual ...
Kafka and GraphQL: Misconceptions and Connections | Gerard Klijs, Open Web
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
David McKay [InfluxData] | Git Lost in Time Series | InfluxDays Virtual Exper...
SC5 Hangout2 pilot 1 description
BeakerX Beaker Extensions for Jupyter
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Collaborative data science and how to build a data science toolchain around n...
How to Deliver a Critical and Actionable Customer-Facing Metrics Product with...
Functional APIs with Absinthe GraphQL
Introduction to DevOps and the Practical Use Cases at Credit OK
Trondheim Eclipe Day 2015 and 2016
Nikhil summer internship 2016
Google APAC Machine Learning Expert Day
How to Streamline Incident Response with InfluxDB, PagerDuty and Rundeck
Drupal Brisbane Meetup :: Drupal in late 2017-2018
Introducing MagnetoDB, a key-value storage sevice for OpenStack
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Ad

Viewers also liked (9)

PDF
Introduction to Python decorators
PDF
Python Web Tutorial
PDF
Python decorators
PPTX
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
ODP
Decorators in Python
PPTX
Advanced Python : Decorators
PDF
Design Thinking - Bootcamp
PDF
Introducing design thinking
PDF
The role of Design Thinking
Introduction to Python decorators
Python Web Tutorial
Python decorators
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
Decorators in Python
Advanced Python : Decorators
Design Thinking - Bootcamp
Introducing design thinking
The role of Design Thinking
Ad

Similar to Spark + i python (20)

PDF
Distributed Deep Learning At Scale On Apache Spark With BigDL
PPTX
Boosting big data with apache spark
PDF
Sharing (or stealing) the jewels of python with big data & the jvm (1)
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
PDF
Data Science with Spark
PDF
Apache Spark and Python: unified Big Data analytics
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
MLOps implemented - how we combine the cloud & open-source to boost data scie...
PPTX
Spark tutorial
PDF
Are general purpose big data systems eating the world?
PDF
Road to NODES - Blazing Fast Ingest with Apache Arrow
PDF
NE Scala 2016 roundup
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
PDF
Simplifying AI integration on Apache Spark
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
Altic's big analytics stack, Charly Clairmont, Altic.
 
PDF
Day 13 - Creating Data Processing Services | Train the Trainers Program
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PPTX
Tranquilizer
Distributed Deep Learning At Scale On Apache Spark With BigDL
Boosting big data with apache spark
Sharing (or stealing) the jewels of python with big data & the jvm (1)
PySpark on Kubernetes @ Python Barcelona March Meetup
Data Science with Spark
Apache Spark and Python: unified Big Data analytics
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Getting started with Apache Spark in Python - PyLadies Toronto 2016
MLOps implemented - how we combine the cloud & open-source to boost data scie...
Spark tutorial
Are general purpose big data systems eating the world?
Road to NODES - Blazing Fast Ingest with Apache Arrow
NE Scala 2016 roundup
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Simplifying AI integration on Apache Spark
Accelerating Big Data beyond the JVM - Fosdem 2018
Altic's big analytics stack, Charly Clairmont, Altic.
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Tranquilizer

Recently uploaded (20)

PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Digital Logic Computer Design lecture notes
PPTX
Current and future trends in Computer Vision.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
PPT on Performance Review to get promotions
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
573137875-Attendance-Management-System-original
PPTX
Sustainable Sites - Green Building Construction
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Safety Seminar civil to be ensured for safe working.
CH1 Production IntroductoryConcepts.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Lecture Notes Electrical Wiring System Components
Digital Logic Computer Design lecture notes
Current and future trends in Computer Vision.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Construction Project Organization Group 2.pptx
PPT on Performance Review to get promotions
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
OOP with Java - Java Introduction (Basics)
573137875-Attendance-Management-System-original
Sustainable Sites - Green Building Construction
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

Spark + i python

  • 1. Spark + IPython The remix theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio 27 March 2015 at @Itnig with @pybcn
  • 2. Index ● Motivation ● Walkthrough ● Demo theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 3. A little about me ● Guillermo Blasco ● Graduate in Mathematics and Software Engineering ● Developing theblackbox.io ● Working as Data Scientist theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 4. Spark, What? ● Distributed computation engine ● Based on Resilient Distributed Dataset ● Runs on JVM, but available from Java, Scala and Python ● Open Source theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 5. Spark, Why? ● Mainly, scalability in terms of ○ Commodity costs ○ Computation time ○ Dataset size ● Hadoop was hard to maintain ● MapReduce is a computational pattern ● RDD is a distributed data model theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 6. IPython, What? ● Interactive computing framework ● “python with batteries” ● Open Source ● Expanding to other languages (Jupyter) theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 7. IPython, Why? ● Powerful interactive remote shells ○ Terminal ○ Qt ○ Notebook ● Easy data visualization ● Configurable in cluster and in parallel ● Embeddable, flexible, extensible theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 8. Wait... ● IPython is cluster configurable ● Spark has an interactive Scala and Python shell ¿Are they not pretty much the same? theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 9. Well, not at all Gin and Vodka are not the same theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 10. Spark + IPython, Why? ● Spark is the leading general purpose distributed computational system today, in terms of productive performance. ● IPython is great to experiment and develop scientific applications. Mix them together to get the best of both. theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 11. So, what is the goal? ● Connecting your IPython environment to a Spark cluster powers your development to process even larger data theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 12. And an extra benefit... Since Spark is production ready, you just have to export* your IPython project to a python script. Meaning: ● No code translations to production environment theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 13. Before mixing it up, understand Spark ● Slave-Master-Client IPython ● (Cluster-)Master-Client theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 14. Spark architecture ● Master node coordinates distribution and resilience of RDD. ● Slave nodes compute the operations over RDDs. ● Client nodes connect to master to request computations. theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 15. IPython architecture ● Master node with a kernel (computation unit) ● Slave nodes handle computations tagged as distributed (%px) ● Client nodes connect to master to request computations. theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 16. ● Configure Spark cluster ● Link IPython kernel to one Spark context ● Use IPython clients to develop scripts with Spark The plan theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 17. Hands On! Let’s drink Gin with Vodka theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio https://github.com/theblackboxio/spark-ipython
  • 18. Conclusions ● Computational power of Spark ● Interactiveness of IPython ● Viable, not that hard to configure ● Also fun theblackbox.io @theblackboxio shakespearecode.io @shksprcodeio
  • 19. Complexities ● Sysadmin work ● Python dependencies