Spark + i python

Spark + IPython
The remix
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
27 March 2015 at @Itnig with @pybcn

Index
● Motivation
● Walkthrough
● Demo
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

A little about me
● Guillermo Blasco
● Graduate in Mathematics and Software
Engineering
● Developing theblackbox.io
● Working as Data Scientist
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Spark, What?
● Distributed computation engine
● Based on Resilient Distributed Dataset
● Runs on JVM, but available from Java, Scala
and Python
● Open Source
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Spark, Why?
● Mainly, scalability in terms of
○ Commodity costs
○ Computation time
○ Dataset size
● Hadoop was hard to maintain
● MapReduce is a computational pattern
● RDD is a distributed data model
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

IPython, What?
● Interactive computing framework
● “python with batteries”
● Open Source
● Expanding to other languages (Jupyter)
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

IPython, Why?
● Powerful interactive remote shells
○ Terminal
○ Qt
○ Notebook
● Easy data visualization
● Configurable in cluster and in parallel
● Embeddable, flexible, extensible
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Wait...
● IPython is cluster configurable
● Spark has an interactive Scala and Python
shell
¿Are they not pretty much the same?
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Well, not at all
Gin and Vodka
are not the same
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Spark + IPython, Why?
● Spark is the leading general purpose
distributed computational system today, in
terms of productive performance.
● IPython is great to experiment and develop
scientific applications.
Mix them together to get the best of both.
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

So, what is the goal?
● Connecting your IPython environment to a
Spark cluster powers your development to
process even larger data
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

And an extra benefit...
Since Spark is production ready, you just have
to export* your IPython project to a python
script. Meaning:
● No code translations to production
environment
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Before mixing it up, understand
Spark
● Slave-Master-Client
IPython
● (Cluster-)Master-Client
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Spark architecture
● Master node coordinates distribution and
resilience of RDD.
● Slave nodes compute the operations over
RDDs.
● Client nodes connect to master to request
computations. theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

IPython architecture
● Master node with a kernel (computation unit)
● Slave nodes handle computations tagged as
distributed (%px)
● Client nodes connect to master to request
computations.
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

● Configure Spark cluster
● Link IPython kernel to one Spark context
● Use IPython clients to develop scripts with
Spark
The plan
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Hands On!
Let’s drink
Gin with Vodka
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
https://github.com/theblackboxio/spark-ipython

Conclusions
● Computational power of Spark
● Interactiveness of IPython
● Viable, not that hard to configure
● Also fun
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio

Complexities
● Sysadmin work
● Python dependencies

Thanks!
Questions?
theblackbox.io
@theblackboxio
shakespearecode.io
@shksprcodeio
Thanks to:
Python BCN (@pybcn)
Itnig (@Itnig)

Spark + i python

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Spark + i python (20)

Recently uploaded (20)

Spark + i python