Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@fzk

frisovanvollenhoven@godatadriven.com
Apache Spark
Friso van Vollenhoven

for applied machine learning

Resilient Distributed Dataset
•Immutable set of records (e.g. tuples)

•Distributed across a cluster of workers

•Stored in RAM or on disk (partially)

•Built through transformations

•Automatically rebuilt on failure

•Possibly replicated

Operations
•Operate on RDD’s

•Create a new RDD

•Or materialise RDD and return data

•Transformations: map, ﬁlter, groupBy, etc.

•Actions: count, collect, reduce, save, etc.

The good parts
•Language bindings for Java, Scala and Python

•Works interactively from a shell:

•Scala + IPython (notebook)

•Plays nice with Hadoop

•Deploy on top of YARN cluster manager

•Read data from HDFS

•Hadoop-like fault tolerance

The better part?
https://github.com/Bridgewater/scala-notebook

https://github.com/Sotera/spark-distributed-louvain-modularity

GoDataDriven
We’re hiring / Questions? / Thank you!
@fzk

frisovanvollenhoven@godatadriven.com
Friso van Vollenhoven

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

More Related Content

What's hot (20)

Similar to Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group (20)

More from fvanvollenhoven (8)

Recently uploaded (20)

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group