SlideShare a Scribd company logo
Spark tutorial, developing
locally and deploying on EMR
Use cases (my biased opinion)
• Interactive and Expressive Data Analysis
• If you feel limited when trying to express yourself in “group by”, “join” and
“where”
• Only if it is not possible to work with datasets locally
• Entering Danger Zone:
• Spark SQL engine, like Impala/Hive
• Speed up ETLs if your data can fit in memory (speculation)
• Machine learning
• Graph analytics
• Streaming (not mature yet)
Possible working styles
• Develop in IDE
• Develop as you go in Spark shell
IDE Spark-shell
Easier to manipulate with objects,
inheritance, package management
Easier to debug code with production
scale data
Requires some hacking to get programs
run on both Windows and Prod
environments
Will only run on Windows if you have
correct line endings in spark-shell
launcher scripts or use Cygwin
IntelliJ IDEA
• Basic set up https://gitz.adform.com/dspr/audience-
extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar
k-skeleton
Hacks
• 99% chance that on Windows you won’t be able to use function
`saveAsTextFile()`
• Download exe file from
http://stackoverflow.com/questions/19620642/failed-to-locate-the-
winutils-binary-in-the-hadoop-binary-path
• Place it somewhere on your PC in bin folder
(C:somewherebinwinutils.exe) and set in your code before using
save function
System.setProperty("hadoop.home.dir", "C:somewhere")
When you are done with your code…
• It is time to package everything to fat jar with sbt assembly
• Add “provided” to library dependencies, since spark libs are already in
the classpath if you run job on emr with spark already set-up
• Find more info in Audience Extension project Spark branch build.sbt
file.
libraryDependencies += "org.apache.spark" %% "spark-core" %
"1.2.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"1.2.0" % "provided"
Running on EMR
• build.sbt can be configured (S3 package) to upload fat jar to s3 when
it is done with assembly, if you don’t have that just upload it manually
• Run bootstrap action s3://support.elasticmapreduce/spark/install-
spark with arguments -v 1.2.0.a -x –g (some documentation in
https://github.com/awslabs/emr-bootstrap-
actions/tree/master/spark)
• Also install ganglia for monitoring cluster load (run this before spark
bootstrap step)
• If you don’t install ganglia ssh tunnels to spark UI won’t work.
Start with local mode first
Use only one instance in cluster, submit your jar with this:
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master local[16] 
--driver-memory 4G 
--conf spark.default.parallelism=112
SimilarityJob.jar 
--remote 
--input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* 
--output s3://dev-adform-data-engineers/tmp/spark/2days 
--similarity-threshold 300
Run on multiple machines with yarn master
/home/hadoop/spark/bin/spark-submit 
--class com.adform.dspr.SimilarityJob 
--master yarn 
--deploy-mode client  #or cluster
--num-executors 7 
--executor-memory 116736 M 
--executor-cores 16 
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4 
SimilarityJob.jar 
--remote 
… … …
Executor parameters are optional, bootstrap
script will automatically try to maximize spark
configuration options. Note that scripts are
not aware of tasks that you are doing, they
only read emr cluster specifications.
Spark UI
• Need to set up ssh tunnel to use access it from your PC
• Alternative is to use command line browser lynx
• When you submit app with local master UI will be in ip:4040
• When you submit with Yarn master, go to Hadoop UI on port 9026, it
will have Spark task running, click on ApplicationMaster in Tracking UI
column, or get UI url from command line when you submit task
Spark UI
For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs
are Jobs, Stages and Environment.
Some useful settings
• spark.hadoop.validateOutputSpecs useful when developing, set to
false, then you can overwrite output files
• spark.default.parallelism (number of output files / number of cores),
automatically configured when you run bootstrap actions with -x
option
• spark.shuffle.consolidateFiles (default false)
• spark.rdd.compress (default false)
• spark.akka.timeout, spark.akka.frameSize, spark.speculation, …
• http://spark.apache.org/docs/1.2.0/configuration.html
Spark shell
/home/hadoop/spark/bin/spark-shell 
--master <yarn|local[*]> 
--deploy-mode client 
--num-executors 7 
--executor-memory 4G 
--executor-cores 16 
--driver-memory 4G
--conf spark.default.parallelism=112 
--conf spark.task.maxFailures=4
Spark shell
• In spark shell you don’t need to instantiate spark context, it is already
intantiated, but you can create another one if you like
• Type scala expressions and see what is happening
• Note the lazy evaluation, to force expression evaluation fore
debugging use action functions like [expression].take(n) or
[expression].count to see if your statements are OK
Summary
• Spark is better suited for developing in Linux
• Don’t trust Amazon bootstrap scripts, check if your application is
utilizing resources with Ganglia
• Try to write scala code in a way that it is possible to run parts of it in
spark-shell, otherwise it is hard to debug problems which occur only
at production dataset scale.

More Related Content

PDF
Spark tuning2016may11bida
PDF
Spark Tuning for Enterprise System Administrators
PDF
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
PDF
SF Solr Meetup - Interactively Search and Visualize Your Big Data
PDF
Hadoop Operations: Keeping the Elephant Running Smoothly
PDF
20150627 bigdatala
PPTX
Ansible Devops North East - slides
PDF
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Spark tuning2016may11bida
Spark Tuning for Enterprise System Administrators
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
SF Solr Meetup - Interactively Search and Visualize Your Big Data
Hadoop Operations: Keeping the Elephant Running Smoothly
20150627 bigdatala
Ansible Devops North East - slides
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas

What's hot (20)

PPTX
Ansible - Why and what
PPTX
Go Faster with Ansible (PHP meetup)
PPTX
DevOps for database
PDF
OSDC 2013 | Introduction into Chef by Andy Hawkins
PDF
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
PPTX
Building Enterprise Search Engines using Open Source Technologies
PDF
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
PDF
CocoaPods Basic Usage
PDF
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
PDF
#SFSE Lightning Talk: WebDriver, ScalaTest, SBT and IntelliJ-IDEA
PPTX
Learn you some Ansible for great good!
PDF
AWS ElasticBeanstalk Advanced configuration
PDF
Hands On Introduction To Ansible Configuration Management With Ansible Comple...
PPTX
What Is Ansible? | How Ansible Works? | Ansible Tutorial For Beginners | DevO...
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
PDF
Final Report - Spark
PDF
MySQL on AWS 101
PDF
LDAP, SAML and Hue
PDF
Analyzing Log Data With Apache Spark
PPTX
Drupal Camp Melbourne
Ansible - Why and what
Go Faster with Ansible (PHP meetup)
DevOps for database
OSDC 2013 | Introduction into Chef by Andy Hawkins
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Building Enterprise Search Engines using Open Source Technologies
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
CocoaPods Basic Usage
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
#SFSE Lightning Talk: WebDriver, ScalaTest, SBT and IntelliJ-IDEA
Learn you some Ansible for great good!
AWS ElasticBeanstalk Advanced configuration
Hands On Introduction To Ansible Configuration Management With Ansible Comple...
What Is Ansible? | How Ansible Works? | Ansible Tutorial For Beginners | DevO...
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Final Report - Spark
MySQL on AWS 101
LDAP, SAML and Hue
Analyzing Log Data With Apache Spark
Drupal Camp Melbourne
Ad

Similar to Spark intro by Adform Research (20)

PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PDF
Productionizing Spark and the Spark Job Server
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PDF
Apache Spark Tutorial
PDF
Spark on YARN
PPTX
How to build your query engine in spark
PDF
Spark 101
PDF
Spark 2.x Troubleshooting Guide
 
PPTX
Spark 101 - First steps to distributed computing
PDF
Introduction to apache spark and the architecture
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Incorta spark integration
PPTX
Introduction to Apache Spark
PDF
Debugging Apache Spark - Scala & Python super happy fun times 2017
PDF
Fast Data Analytics with Spark and Python
PDF
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
PDF
Hadoop spark online demo
PDF
Spark + H20 = Machine Learning at scale
PPTX
ETL with SPARK - First Spark London meetup
PDF
Spark Working Environment in Windows OS
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Productionizing Spark and the Spark Job Server
Productionizing Spark and the REST Job Server- Evan Chan
Apache Spark Tutorial
Spark on YARN
How to build your query engine in spark
Spark 101
Spark 2.x Troubleshooting Guide
 
Spark 101 - First steps to distributed computing
Introduction to apache spark and the architecture
Scalding by Adform Research, Alex Gryzlov
Incorta spark integration
Introduction to Apache Spark
Debugging Apache Spark - Scala & Python super happy fun times 2017
Fast Data Analytics with Spark and Python
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Hadoop spark online demo
Spark + H20 = Machine Learning at scale
ETL with SPARK - First Spark London meetup
Spark Working Environment in Windows OS
Ad

More from Vasil Remeniuk (20)

PPTX
Product Minsk - РТБ и Программатик
PDF
Работа с Akka Сluster, @afiskon, scalaby#14
PDF
Cake pattern. Presentation by Alex Famin at scalaby#14
PDF
Scala laboratory: Globus. iteration #3
PPTX
Testing in Scala by Adform research
PPTX
Spark Intro by Adform Research
PPTX
Types by Adform Research, Saulius Valatka
PPTX
Types by Adform Research
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Spark by Adform Research, Paulius
PPTX
Scala Style by Adform Research (Saulius Valatka)
PPTX
SBT by Aform Research, Saulius Valatka
PDF
Scala laboratory: Globus. iteration #2
PPTX
Testing in Scala. Adform Research
PDF
Scala laboratory. Globus. iteration #1
PDF
Cassandra + Spark + Elk
PDF
Опыт использования Spark, Основано на реальных событиях
PDF
ETL со Spark
PDF
Funtional Reactive Programming with Examples in Scala + GWT
PDF
Vaadin+Scala
Product Minsk - РТБ и Программатик
Работа с Akka Сluster, @afiskon, scalaby#14
Cake pattern. Presentation by Alex Famin at scalaby#14
Scala laboratory: Globus. iteration #3
Testing in Scala by Adform research
Spark Intro by Adform Research
Types by Adform Research, Saulius Valatka
Types by Adform Research
Scalding by Adform Research, Alex Gryzlov
Spark by Adform Research, Paulius
Scala Style by Adform Research (Saulius Valatka)
SBT by Aform Research, Saulius Valatka
Scala laboratory: Globus. iteration #2
Testing in Scala. Adform Research
Scala laboratory. Globus. iteration #1
Cassandra + Spark + Elk
Опыт использования Spark, Основано на реальных событиях
ETL со Spark
Funtional Reactive Programming with Examples in Scala + GWT
Vaadin+Scala

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”

Spark intro by Adform Research

  • 1. Spark tutorial, developing locally and deploying on EMR
  • 2. Use cases (my biased opinion) • Interactive and Expressive Data Analysis • If you feel limited when trying to express yourself in “group by”, “join” and “where” • Only if it is not possible to work with datasets locally • Entering Danger Zone: • Spark SQL engine, like Impala/Hive • Speed up ETLs if your data can fit in memory (speculation) • Machine learning • Graph analytics • Streaming (not mature yet)
  • 3. Possible working styles • Develop in IDE • Develop as you go in Spark shell IDE Spark-shell Easier to manipulate with objects, inheritance, package management Easier to debug code with production scale data Requires some hacking to get programs run on both Windows and Prod environments Will only run on Windows if you have correct line endings in spark-shell launcher scripts or use Cygwin
  • 4. IntelliJ IDEA • Basic set up https://gitz.adform.com/dspr/audience- extension/tree/38b4b0588902457677f985caf6eb356e037a668c/spar k-skeleton
  • 5. Hacks • 99% chance that on Windows you won’t be able to use function `saveAsTextFile()` • Download exe file from http://stackoverflow.com/questions/19620642/failed-to-locate-the- winutils-binary-in-the-hadoop-binary-path • Place it somewhere on your PC in bin folder (C:somewherebinwinutils.exe) and set in your code before using save function System.setProperty("hadoop.home.dir", "C:somewhere")
  • 6. When you are done with your code… • It is time to package everything to fat jar with sbt assembly • Add “provided” to library dependencies, since spark libs are already in the classpath if you run job on emr with spark already set-up • Find more info in Audience Extension project Spark branch build.sbt file. libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided" libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.2.0" % "provided"
  • 7. Running on EMR • build.sbt can be configured (S3 package) to upload fat jar to s3 when it is done with assembly, if you don’t have that just upload it manually • Run bootstrap action s3://support.elasticmapreduce/spark/install- spark with arguments -v 1.2.0.a -x –g (some documentation in https://github.com/awslabs/emr-bootstrap- actions/tree/master/spark) • Also install ganglia for monitoring cluster load (run this before spark bootstrap step) • If you don’t install ganglia ssh tunnels to spark UI won’t work.
  • 8. Start with local mode first Use only one instance in cluster, submit your jar with this: /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master local[16] --driver-memory 4G --conf spark.default.parallelism=112 SimilarityJob.jar --remote --input s3://adform-dsp-warehouse/data/facts/impressions/dt=20150109/* --output s3://dev-adform-data-engineers/tmp/spark/2days --similarity-threshold 300
  • 9. Run on multiple machines with yarn master /home/hadoop/spark/bin/spark-submit --class com.adform.dspr.SimilarityJob --master yarn --deploy-mode client #or cluster --num-executors 7 --executor-memory 116736 M --executor-cores 16 --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4 SimilarityJob.jar --remote … … … Executor parameters are optional, bootstrap script will automatically try to maximize spark configuration options. Note that scripts are not aware of tasks that you are doing, they only read emr cluster specifications.
  • 10. Spark UI • Need to set up ssh tunnel to use access it from your PC • Alternative is to use command line browser lynx • When you submit app with local master UI will be in ip:4040 • When you submit with Yarn master, go to Hadoop UI on port 9026, it will have Spark task running, click on ApplicationMaster in Tracking UI column, or get UI url from command line when you submit task
  • 11. Spark UI For spark 1.2.0 Executors tab is wrong, storage is always empty, only useful tabs are Jobs, Stages and Environment.
  • 12. Some useful settings • spark.hadoop.validateOutputSpecs useful when developing, set to false, then you can overwrite output files • spark.default.parallelism (number of output files / number of cores), automatically configured when you run bootstrap actions with -x option • spark.shuffle.consolidateFiles (default false) • spark.rdd.compress (default false) • spark.akka.timeout, spark.akka.frameSize, spark.speculation, … • http://spark.apache.org/docs/1.2.0/configuration.html
  • 13. Spark shell /home/hadoop/spark/bin/spark-shell --master <yarn|local[*]> --deploy-mode client --num-executors 7 --executor-memory 4G --executor-cores 16 --driver-memory 4G --conf spark.default.parallelism=112 --conf spark.task.maxFailures=4
  • 14. Spark shell • In spark shell you don’t need to instantiate spark context, it is already intantiated, but you can create another one if you like • Type scala expressions and see what is happening • Note the lazy evaluation, to force expression evaluation fore debugging use action functions like [expression].take(n) or [expression].count to see if your statements are OK
  • 15. Summary • Spark is better suited for developing in Linux • Don’t trust Amazon bootstrap scripts, check if your application is utilizing resources with Ganglia • Try to write scala code in a way that it is possible to run parts of it in spark-shell, otherwise it is hard to debug problems which occur only at production dataset scale.