SlideShare a Scribd company logo
Spark tutorial, developing
locally and deploying on EMR
Use cases (my biased opinion)
• Interactive and Expressive Data Analysis
• If you feel limited when trying to express yourself in “group by”, “join” and
“where”
• Only if it is not possible to work with datasets locally
• Entering Danger Zone:
• Spark SQL engine, like Impala/Hive
• Speed up ETLs if your data can fit in memory (speculation)
• Machine learning
• Graph analytics
• Streaming (not mature yet)
Possible working styles
• Develop in IDE
• Develop as you go in Spark shell
IDE Spark-shell
Easier to manipulate with objects,
inheritance, package management
Easier to debug code with production
scale data
Requires some hacking to get programs
run on both Windows and Prod
environments
Will only run on Windows if you have
correct line endings in spark-shell
launcher scripts or use Cygwin
IntelliJ IDEA
• Basic set up https://gitz.adform.com/dspr/audience-
extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar
k-skeleton
Hacks
• 99% chance that on Windows you won’t be able to use function
`saveAsTextFile()`
• Download exe file from
http://stackoverflow.com/questions/19620642/failed-to-locate-the-
winutils-binary-in-the-hadoop-binary-path
• Place it somewhere on your PC in bin folder
(C:somewherebinwinutils.exe) and set in your code before using
save function
System.setProperty("hadoop.home.dir", "C:somewhere")
When you are done with your code…
• It is time to package everything to fat jar with sbt assembly
• Add “provided” to library dependencies, since spark libs are already in
the classpath if you run job on emr with spark already set-up
• Find more info in Audience Extension project Spark branch build.sbt
file.
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"1.2.0" % "provided"
Running on EMR
• build.sbt can be configured (S3 package) to upload fat jar to s3 when
it is done with assembly, if you don’t have that just upload it manually
• Run bootstrap action s3://support.elasticmapreduce/spark/install-
spark with arguments -v 1.2.0.a -x –g (some documentation in
https://github.com/awslabs/emr-bootstrap-
actions/tree/master/spark)
• Also install ganglia for monitoring cluster load (run this before spark
bootstrap step)
• If you don’t install ganglia ssh tunnels to spark UI won’t work.
Start with local mode first
Use only one instance in cluster, submit your jar with this:
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master local[16] 
--driver-memory 4G 
--conf spark.default.parallelism=112
SimilarityJob.jar 
--remote 
--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* 
--output s3://dev-adform-data-engineers/tmp/spark/2days 
--similarity-threshold 300
Run on multiple machines with yarn master
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master yarn 
--deploy-mode client  #or cluster
--num-executors 7 
--executor-memory 116736 M 
--executor-cores 16 
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4 
SimilarityJob.jar 
--remote 
… … …
Executor parameters are optional, bootstrap
script will automatically try to maximize spark
configuration options. Note that scripts are
not aware of tasks that you are doing, they
only read emr cluster specifications.
Spark UI
• Need to set up ssh tunnel to use access it from your PC
• Alternative is to use command line browser lynx
• When you submit app with local master UI will be in ip:4040
• When you submit with Yarn master, go to Hadoop UI on port 9026, it
will have Spark task running, click on ApplicationMaster in Tracking UI
column, or get UI url from command line when you submit task
Spark UI
For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs
are Jobs, Stages and Environment.
Some useful settings
• spark.hadoop.validateOutputSpecs useful when developing, set to
false, then you can overwrite output files
• spark.default.parallelism (number of output files / number of cores),
automatically configured when you run bootstrap actions with -x
option
• spark.shuffle.consolidateFiles (default false)
• spark.rdd.compress (default false)
• spark.akka.timeout, spark.akka.frameSize, spark.speculation, …
• http://spark.apache.org/docs/1.2.0/configuration.html
Spark shell
/home/hadoop/spark/bin/spark-shell 
--master <yarn|local[*]> 
--deploy-mode client 
--num-executors 7 
--executor-memory 4G 
--executor-cores 16 
--driver-memory 4G
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4
Spark shell
• In spark shell you don’t need to instantiate spark context, it is already
intantiated, but you can create another one if you like
• Type scala expressions and see what is happening
• Note the lazy evaluation, to force expression evaluation fore
debugging use action functions like [expression].take(n) or
[expression].count to see if your statements are OK
Summary
• Spark is better suited for developing in Linux
• Don’t trust Amazon bootstrap scripts, check if your application is
utilizing resources with Ganglia
• Try to write scala code in a way that it is possible to run parts of it in
spark-shell, otherwise it is hard to debug problems which occur only
at production dataset scale.

More Related Content

PPTX
Testing in Scala. Adform Research
PPTX
Spark intro by Adform Research
PDF
Akka lsug skills matter
PPTX
Akka.net versus microsoft orleans
PPTX
Ansible Devops North East - slides
ZIP
5分で説明する Play! scala
PDF
Akka Cluster in Java - JCConf 2015
KEY
2011/10/08_Playframework_GAE_to_Heroku
Testing in Scala. Adform Research
Spark intro by Adform Research
Akka lsug skills matter
Akka.net versus microsoft orleans
Ansible Devops North East - slides
5分で説明する Play! scala
Akka Cluster in Java - JCConf 2015
2011/10/08_Playframework_GAE_to_Heroku

What's hot (20)

PDF
とりあえず使うScalaz
PPTX
Akka Actor presentation
PDF
Full Stack Scala
PPTX
"Walk in a distributed systems park with Orleans" Евгений Бобров
PPTX
Extending ansible
PDF
Scala.js - yet another what..?
PDF
First glance at Akka 2.0
KEY
Curator intro
PDF
Akka in Practice: Designing Actor-based Applications
PDF
Real-time search in Drupal. Meet Elasticsearch
PDF
ChefConf 2014 - AWS OpsWorks Under The Hood
PDF
Go database/sql
PDF
20150627 bigdatala
PPTX
Introduction to Akka - Atlanta Java Users Group
KEY
Wider than rails
PPTX
Terraform day02
PDF
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
PDF
What's New in Apache Solr 4.10
PPTX
Scala.js for large and complex frontend apps
PPTX
Building an aws sdk for Perl - Granada Perl Workshop 2014
とりあえず使うScalaz
Akka Actor presentation
Full Stack Scala
"Walk in a distributed systems park with Orleans" Евгений Бобров
Extending ansible
Scala.js - yet another what..?
First glance at Akka 2.0
Curator intro
Akka in Practice: Designing Actor-based Applications
Real-time search in Drupal. Meet Elasticsearch
ChefConf 2014 - AWS OpsWorks Under The Hood
Go database/sql
20150627 bigdatala
Introduction to Akka - Atlanta Java Users Group
Wider than rails
Terraform day02
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
What's New in Apache Solr 4.10
Scala.js for large and complex frontend apps
Building an aws sdk for Perl - Granada Perl Workshop 2014
Ad

Similar to Spark Intro by Adform Research (20)

PDF
R Data Access from hdfs,spark,hive
PDF
Spark Working Environment in Windows OS
PPTX
How to build your query engine in spark
PDF
Final Report - Spark
PPTX
Introduction to Apache Spark and MLlib
PDF
Introduction to Apache Spark Ecosystem
PDF
Fast Data Analytics with Spark and Python
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Spark with HDInsight
PDF
Apache spark - Installation
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PPTX
In Memory Analytics with Apache Spark
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PDF
AWS meetup「Apache Spark on EMR」
PPTX
YARN Ready: Apache Spark
PPTX
Apache Spark SQL- Installing Spark
PPTX
ETL with SPARK - First Spark London meetup
PPTX
Spark crash course workshop at Hadoop Summit
PDF
Hortonworks tech workshop in-memory processing with spark
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
R Data Access from hdfs,spark,hive
Spark Working Environment in Windows OS
How to build your query engine in spark
Final Report - Spark
Introduction to Apache Spark and MLlib
Introduction to Apache Spark Ecosystem
Fast Data Analytics with Spark and Python
Real time Analytics with Apache Kafka and Apache Spark
Spark with HDInsight
Apache spark - Installation
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
In Memory Analytics with Apache Spark
Introduction to Apache Spark :: Lagos Scala Meetup session 2
AWS meetup「Apache Spark on EMR」
YARN Ready: Apache Spark
Apache Spark SQL- Installing Spark
ETL with SPARK - First Spark London meetup
Spark crash course workshop at Hadoop Summit
Hortonworks tech workshop in-memory processing with spark
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Ad

More from Vasil Remeniuk (20)

PPTX
Product Minsk - РТБ и Программатик
PDF
Работа с Akka Сluster, @afiskon, scalaby#14
PDF
Cake pattern. Presentation by Alex Famin at scalaby#14
PDF
Scala laboratory: Globus. iteration #3
PPTX
Testing in Scala by Adform research
PPTX
Types by Adform Research, Saulius Valatka
PPTX
Types by Adform Research
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Spark by Adform Research, Paulius
PPTX
Scala Style by Adform Research (Saulius Valatka)
PPTX
SBT by Aform Research, Saulius Valatka
PDF
Scala laboratory: Globus. iteration #2
PDF
Scala laboratory. Globus. iteration #1
PDF
Cassandra + Spark + Elk
PDF
Опыт использования Spark, Основано на реальных событиях
PDF
ETL со Spark
PDF
Funtional Reactive Programming with Examples in Scala + GWT
PDF
Vaadin+Scala
PDF
[Не]практичные типы
Product Minsk - РТБ и Программатик
Работа с Akka Сluster, @afiskon, scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14
Scala laboratory: Globus. iteration #3
Testing in Scala by Adform research
Types by Adform Research, Saulius Valatka
Types by Adform Research
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
Spark by Adform Research, Paulius
Scala Style by Adform Research (Saulius Valatka)
SBT by Aform Research, Saulius Valatka
Scala laboratory: Globus. iteration #2
Scala laboratory. Globus. iteration #1
Cassandra + Spark + Elk
Опыт использования Spark, Основано на реальных событиях
ETL со Spark
Funtional Reactive Programming with Examples in Scala + GWT
Vaadin+Scala
[Не]практичные типы

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
KodekX | Application Modernization Development
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KodekX | Application Modernization Development
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD

Spark Intro by Adform Research

  • 1. Spark tutorial, developing locally and deploying on EMR
  • 2. Use cases (my biased opinion) • Interactive and Expressive Data Analysis • If you feel limited when trying to express yourself in “group by”, “join” and “where” • Only if it is not possible to work with datasets locally • Entering Danger Zone: • Spark SQL engine, like Impala/Hive • Speed up ETLs if your data can fit in memory (speculation) • Machine learning • Graph analytics • Streaming (not mature yet)
  • 3. Possible working styles • Develop in IDE • Develop as you go in Spark shell IDE Spark-shell Easier to manipulate with objects, inheritance, package management Easier to debug code with production scale data Requires some hacking to get programs run on both Windows and Prod environments Will only run on Windows if you have correct line endings in spark-shell launcher scripts or use Cygwin
  • 4. IntelliJ IDEA • Basic set up https://gitz.adform.com/dspr/audience- extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar k-skeleton
  • 5. Hacks • 99% chance that on Windows you won’t be able to use function `saveAsTextFile()` • Download exe file from http://stackoverflow.com/questions/19620642/failed-to-locate-the- winutils-binary-in-the-hadoop-binary-path • Place it somewhere on your PC in bin folder (C:somewherebinwinutils.exe) and set in your code before using save function System.setProperty("hadoop.home.dir", "C:somewhere")
  • 6. When you are done with your code… • It is time to package everything to fat jar with sbt assembly • Add “provided” to library dependencies, since spark libs are already in the classpath if you run job on emr with spark already set-up • Find more info in Audience Extension project Spark branch build.sbt file. libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided"
  • 7. Running on EMR • build.sbt can be configured (S3 package) to upload fat jar to s3 when it is done with assembly, if you don’t have that just upload it manually • Run bootstrap action s3://support.elasticmapreduce/spark/install- spark with arguments -v 1.2.0.a -x –g (some documentation in https://github.com/awslabs/emr-bootstrap- actions/tree/master/spark) • Also install ganglia for monitoring cluster load (run this before spark bootstrap step) • If you don’t install ganglia ssh tunnels to spark UI won’t work.
  • 8. Start with local mode first Use only one instance in cluster, submit your jar with this: /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master local[16] --driver-memory 4G --conf spark.default.parallelism=112 SimilarityJob.jar --remote --input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* --output s3://dev-adform-data-engineers/tmp/spark/2days --similarity-threshold 300
  • 9. Run on multiple machines with yarn master /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master yarn --deploy-mode client #or cluster --num-executors 7 --executor-memory 116736 M --executor-cores 16 --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4 SimilarityJob.jar --remote … … … Executor parameters are optional, bootstrap script will automatically try to maximize spark configuration options. Note that scripts are not aware of tasks that you are doing, they only read emr cluster specifications.
  • 10. Spark UI • Need to set up ssh tunnel to use access it from your PC • Alternative is to use command line browser lynx • When you submit app with local master UI will be in ip:4040 • When you submit with Yarn master, go to Hadoop UI on port 9026, it will have Spark task running, click on ApplicationMaster in Tracking UI column, or get UI url from command line when you submit task
  • 11. Spark UI For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs are Jobs, Stages and Environment.
  • 12. Some useful settings • spark.hadoop.validateOutputSpecs useful when developing, set to false, then you can overwrite output files • spark.default.parallelism (number of output files / number of cores), automatically configured when you run bootstrap actions with -x option • spark.shuffle.consolidateFiles (default false) • spark.rdd.compress (default false) • spark.akka.timeout, spark.akka.frameSize, spark.speculation, … • http://spark.apache.org/docs/1.2.0/configuration.html
  • 13. Spark shell /home/hadoop/spark/bin/spark-shell --master <yarn|local[*]> --deploy-mode client --num-executors 7 --executor-memory 4G --executor-cores 16 --driver-memory 4G --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4
  • 14. Spark shell • In spark shell you don’t need to instantiate spark context, it is already intantiated, but you can create another one if you like • Type scala expressions and see what is happening • Note the lazy evaluation, to force expression evaluation fore debugging use action functions like [expression].take(n) or [expression].count to see if your statements are OK
  • 15. Summary • Spark is better suited for developing in Linux • Don’t trust Amazon bootstrap scripts, check if your application is utilizing resources with Ganglia • Try to write scala code in a way that it is possible to run parts of it in spark-shell, otherwise it is hard to debug problems which occur only at production dataset scale.