SlideShare a Scribd company logo
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@fzk	

frisovanvollenhoven@godatadriven.com
Apache Spark
Friso van Vollenhoven	

for applied machine learning
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
This talk is about tools.
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Resilient Distributed Dataset
•Immutable set of records (e.g. tuples)	

•Distributed across a cluster of workers	

•Stored in RAM or on disk (partially)	

•Built through transformations	

•Automatically rebuilt on failure	

•Possibly replicated
Operations
•Operate on RDD’s	

•Create a new RDD	

•Or materialise RDD and return data	

•Transformations: map, filter, groupBy, etc.	

•Actions: count, collect, reduce, save, etc.
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
The good parts
•Language bindings for Java, Scala and Python	

•Works interactively from a shell:	

•Scala + IPython (notebook)	

•Plays nice with Hadoop	

•Deploy on top of YARN cluster manager	

•Read data from HDFS	

•Hadoop-like fault tolerance
The better part?
https://github.com/Bridgewater/scala-notebook
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
https://github.com/Sotera/spark-distributed-louvain-modularity
GoDataDriven
We’re hiring / Questions? / Thank you!
@fzk	

frisovanvollenhoven@godatadriven.com
Friso van Vollenhoven

More Related Content

PDF
JFall 2011 no sql workshop
PDF
RuG Guest Lecture
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
PPTX
File Context
PPTX
HDFS Internals
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Asbury Hadoop Overview
PPTX
Hadoop architecture meetup
JFall 2011 no sql workshop
RuG Guest Lecture
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
File Context
HDFS Internals
Hadoop training in hyderabad-kellytechnologies
Asbury Hadoop Overview
Hadoop architecture meetup

What's hot (20)

PPS
Searching At Scale
PPTX
Building a Scalable Web Crawler with Hadoop
PDF
Hadoop trainting in hyderabad@kelly technologies
PPTX
Unit ii sem-v-hadoop
PDF
Hadoop-Introduction
ODP
Hadoop Ecosystem Overview
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PPT
Nextag talk
PPTX
HDFS Tiered Storage
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PPTX
Pptx present
PPT
Nov 2010 HUG: Fuzzy Table - B.A.H
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Transactional writes to cloud storage with Eric Liang
PDF
Hdfs high availability
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PDF
Syncsort et le retour d'expérience ComScore
PPTX
January 2011 HUG: Pig Presentation
PPTX
A Comparative Performance Evaluation of Apache Flink
PDF
Cascading - A Java Developer’s Companion to the Hadoop World
Searching At Scale
Building a Scalable Web Crawler with Hadoop
Hadoop trainting in hyderabad@kelly technologies
Unit ii sem-v-hadoop
Hadoop-Introduction
Hadoop Ecosystem Overview
Cosco: An Efficient Facebook-Scale Shuffle Service
Nextag talk
HDFS Tiered Storage
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Pptx present
Nov 2010 HUG: Fuzzy Table - B.A.H
HDFS: Hadoop Distributed Filesystem
Transactional writes to cloud storage with Eric Liang
Hdfs high availability
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Syncsort et le retour d'expérience ComScore
January 2011 HUG: Pig Presentation
A Comparative Performance Evaluation of Apache Flink
Cascading - A Java Developer’s Companion to the Hadoop World
Ad

Similar to Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group (20)

PPTX
Introduction to pyspark for civil engineers
PDF
Introduction to Apache Spark Ecosystem
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
KEY
Depolying Drupal with Git, Drush Make and Capistrano
PPT
Presentation
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PDF
Hadoop distributed computing framework for big data
PPTX
Introduction to HDFS and MapReduce
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
Hadoop - HDFS
PPTX
Hadoop with Python
PPTX
Introduction to hadoop V2
PDF
Hadoop and object stores can we do it better
PDF
Hadoop and object stores: Can we do it better?
PDF
Distributed Data processing in a Cloud
ODP
Sumedh Wale's presentation
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop and BigData - July 2016
PDF
Introduction to Impala
Introduction to pyspark for civil engineers
Introduction to Apache Spark Ecosystem
Topic 9a-Hadoop Storage- HDFS.pptx
Depolying Drupal with Git, Drush Make and Capistrano
Presentation
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Hadoop distributed computing framework for big data
Introduction to HDFS and MapReduce
Unit II Real Time Data Processing tools.pptx
Hadoop - HDFS
Hadoop with Python
Introduction to hadoop V2
Hadoop and object stores can we do it better
Hadoop and object stores: Can we do it better?
Distributed Data processing in a Cloud
Sumedh Wale's presentation
Introduction to Hadoop and Big Data
Hadoop and BigData - July 2016
Introduction to Impala
Ad

More from fvanvollenhoven (8)

PDF
Xebicon 2015 - Go Data Driven NOW!
PDF
Prototyping online ML with Divolte Collector
PDF
Divolte Collector - meetup presentation
PDF
Network analysis with Hadoop and Neo4j
PDF
NoSQL War Stories preso: Hadoop and Neo4j for networks
PDF
GOTO 2011 preso: 3x Hadoop
PDF
Hadoop, HDFS and MapReduce
KEY
Berlin Buzzwords preso
Xebicon 2015 - Go Data Driven NOW!
Prototyping online ML with Divolte Collector
Divolte Collector - meetup presentation
Network analysis with Hadoop and Neo4j
NoSQL War Stories preso: Hadoop and Neo4j for networks
GOTO 2011 preso: 3x Hadoop
Hadoop, HDFS and MapReduce
Berlin Buzzwords preso

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group