SlideShare a Scribd company logo
Cluj – Timisoara – Bucharest => Big Data meetup group
Cluj-Romania March 2015
Apache Spark
The next level for data processing
Radu MOLDOVAN
About me

20 years of programming mostly with open source projects

last 3 years worked in Big Data

team lead@ building the 5th
generation of
xPatterns Platform
mail: radu.adrian.moldovan@gmail.com, skype: r.moldovan
Content

- modules

- programming model (RDDs)

- architecture

- Spark & Mesos (cluster Resource Manager)

- demo (Spark Core + Spark SQL - Scala code)
Spark experience [0.8.0 → 1.2.0]
Source: spark.apache.org/
SPARK - introduction
- started in 2009 by Matei Zaharia at Berkeley
AMPLab (Ion Stoica)
- cluster computing platform (scalability &
flexibility)
- in-memory data processing for large datasets
- best alternative for Hadoop Map-Reduce,
compatible with Hadoop eco-system(HDFS &
fileFormat)

Core: Scala, Java, Python

Streaming: real-time streams processing

SQL: structured data and queries

MLlib: built-in machine learning libraries

GraphX: graph processing
SPARK - modules
SPARK – programming model
RDDs (Resilient Distributed DataSet)
- distributed and immutable collection of items + fault tolerance(auto rebuild on failure)
- create RDDs (from files(HDFS/Tachyon), parallelize data, from other RDDs)
- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)
Operations (much more than map-reduce) – more than 80
- Transformation(new RDDs):
map, filter, join, distinct, union, intersection, groupByKey, reduceByKey
- Actions(data on driver):
count, collect, foreach, reduce(func), first, saveAs{Text, Sequence,Object}File
!!! nested RDDs are not supported
SPARK - architecture
1 Job(driver program) –> [1..n] Stages
1 Stage – [1..m] Tasks
SPARK – user interface
SPARK – Core - demo part 1
SPARK – Core – demo part 2
SPARK – Core – demo part 3
SPARK – SQL – demo part 1
SPARK – SQL – demo part 2
SPARK – SQL – demo part 3
SPARK – with Mesos (part 1)
SPARK – with Mesos (part 2)
0.8.0 - first POC … lots of OOM
0.8.1 – first production deployment, still lots of OOM

20 billion healthcare records, 200 TB of compressed hdfs data, ¾ PB of uncompressed data

Hadoop MR: 100Amazon instances - m1.xlarge (4cores x 15GB RAM x 2-4TB HDD)

Daily processing reduced from 14 hours to 1.5 hours!
0.9.0 - fixed many of the problems, but still requires patches! spilling on the
reducer side fixed (less OOM)
1.1.0 & 1.2.0 extensive usage of SchemaRdds, Rows & Parquet file
format
1.3.0 – just released this year
SPARK experience [0.8.0 → 1.2.0]
…what's next for Bigdata meetup?
Contributors
Jaws, xPatterns http spark sql server!
Restful service for running Spark SQL/Shark queries on top of Spark
- Spark 0.9.1 with Shark
- Spark 1.0.1, 1.0.2, 1.1.0 and 1.0.2 with SparkSQL as backend
framework.
Backend in spray io (REST on Akka)
http://github.com/Atigeo/http-spark-sql-server
Spark Job Server
solves inability to run multiple Spark contexts from the same JVM
multiple Spark contexts with distinct JVM
job submission in Java + Scala
https://github.com/Atigeo/spark-job-rest
… BIGthank you!!!
Cluj meetup bigdata-final-version

More Related Content

PPTX
DevTalks Bucharest
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
New directions for Apache Spark in 2015
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
The BDAS Open Source Community
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
DevTalks Bucharest
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Spark Summit EU 2015: Reynold Xin Keynote
New directions for Apache Spark in 2015
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
What to Expect for Big Data and Apache Spark in 2017
The BDAS Open Source Community
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

What's hot (20)

PDF
Spark Summit 2015 keynote: Making Big Data Simple with Spark
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
New Developments in Spark
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Scala: the unpredicted lingua franca for data science
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Spark sql under the hood - Data KRK meetup
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
PDF
Spark Summit EU talk by Ahsan Javed Awan
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Big data apache spark + scala
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
PDF
Databricks with R: Deep Dive
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
Databricks @ Strata SJ
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PDF
Real-Time Spark: From Interactive Queries to Streaming
Spark Summit 2015 keynote: Making Big Data Simple with Spark
New Directions for Spark in 2015 - Spark Summit East
New Developments in Spark
Spark streaming State of the Union - Strata San Jose 2015
Scala: the unpredicted lingua franca for data science
An Introduction to Sparkling Water by Michal Malohlava
Spark Application Carousel: Highlights of Several Applications Built with Spark
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Spark sql under the hood - Data KRK meetup
Announcing Databricks Cloud (Spark Summit 2014)
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Spark Summit EU talk by Ahsan Javed Awan
Strata NYC 2015 - Supercharging R with Apache Spark
Big data apache spark + scala
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Databricks with R: Deep Dive
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks @ Strata SJ
Spark Streaming and MLlib - Hyderabad Spark Group
Real-Time Spark: From Interactive Queries to Streaming
Ad

Similar to Cluj meetup bigdata-final-version (20)

PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
In Memory Analytics with Apache Spark
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PPTX
Apache spark
PPTX
Apachespark 160612140708
PDF
Started with-apache-spark
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Unified Big Data Processing with Apache Spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PDF
Review on Apache Spark Technology
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Introduction to Apache Spark
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
PDF
Apache Spark and Python: unified Big Data analytics
PDF
Big data processing with apache spark
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
In Memory Analytics with Apache Spark
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Apache spark
Apachespark 160612140708
Started with-apache-spark
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark
Apache Spark: The Next Gen toolset for Big Data Processing
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Review on Apache Spark Technology
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Introduction to Apache Spark
Intro to Apache Spark by CTO of Twingo
Jump Start on Apache Spark 2.2 with Databricks
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Apache Spark and Python: unified Big Data analytics
Big data processing with apache spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Ad

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Managing Community Partner Relationships
PPT
Quality review (1)_presentation of this 21
PDF
Transcultural that can help you someday.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Introduction to the R Programming Language
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Miokarditis (Inflamasi pada Otot Jantung)
Managing Community Partner Relationships
Quality review (1)_presentation of this 21
Transcultural that can help you someday.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to the R Programming Language
IBA_Chapter_11_Slides_Final_Accessible.pptx
SAP 2 completion done . PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Optimise Shopper Experiences with a Strong Data Estate.pdf
.pdf is not working space design for the following data for the following dat...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Analytics and business intelligence.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj

Cluj meetup bigdata-final-version

  • 1. Cluj – Timisoara – Bucharest => Big Data meetup group Cluj-Romania March 2015 Apache Spark The next level for data processing Radu MOLDOVAN
  • 2. About me  20 years of programming mostly with open source projects  last 3 years worked in Big Data  team lead@ building the 5th generation of xPatterns Platform mail: radu.adrian.moldovan@gmail.com, skype: r.moldovan
  • 3. Content  - modules  - programming model (RDDs)  - architecture  - Spark & Mesos (cluster Resource Manager)  - demo (Spark Core + Spark SQL - Scala code) Spark experience [0.8.0 → 1.2.0] Source: spark.apache.org/
  • 4. SPARK - introduction - started in 2009 by Matei Zaharia at Berkeley AMPLab (Ion Stoica) - cluster computing platform (scalability & flexibility) - in-memory data processing for large datasets - best alternative for Hadoop Map-Reduce, compatible with Hadoop eco-system(HDFS & fileFormat)
  • 5.  Core: Scala, Java, Python  Streaming: real-time streams processing  SQL: structured data and queries  MLlib: built-in machine learning libraries  GraphX: graph processing SPARK - modules
  • 6. SPARK – programming model RDDs (Resilient Distributed DataSet) - distributed and immutable collection of items + fault tolerance(auto rebuild on failure) - create RDDs (from files(HDFS/Tachyon), parallelize data, from other RDDs) - persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP) Operations (much more than map-reduce) – more than 80 - Transformation(new RDDs): map, filter, join, distinct, union, intersection, groupByKey, reduceByKey - Actions(data on driver): count, collect, foreach, reduce(func), first, saveAs{Text, Sequence,Object}File !!! nested RDDs are not supported
  • 7. SPARK - architecture 1 Job(driver program) –> [1..n] Stages 1 Stage – [1..m] Tasks
  • 8. SPARK – user interface
  • 9. SPARK – Core - demo part 1
  • 10. SPARK – Core – demo part 2
  • 11. SPARK – Core – demo part 3
  • 12. SPARK – SQL – demo part 1
  • 13. SPARK – SQL – demo part 2
  • 14. SPARK – SQL – demo part 3
  • 15. SPARK – with Mesos (part 1)
  • 16. SPARK – with Mesos (part 2)
  • 17. 0.8.0 - first POC … lots of OOM 0.8.1 – first production deployment, still lots of OOM  20 billion healthcare records, 200 TB of compressed hdfs data, ¾ PB of uncompressed data  Hadoop MR: 100Amazon instances - m1.xlarge (4cores x 15GB RAM x 2-4TB HDD)  Daily processing reduced from 14 hours to 1.5 hours! 0.9.0 - fixed many of the problems, but still requires patches! spilling on the reducer side fixed (less OOM) 1.1.0 & 1.2.0 extensive usage of SchemaRdds, Rows & Parquet file format 1.3.0 – just released this year SPARK experience [0.8.0 → 1.2.0]
  • 18. …what's next for Bigdata meetup? Contributors Jaws, xPatterns http spark sql server! Restful service for running Spark SQL/Shark queries on top of Spark - Spark 0.9.1 with Shark - Spark 1.0.1, 1.0.2, 1.1.0 and 1.0.2 with SparkSQL as backend framework. Backend in spray io (REST on Akka) http://github.com/Atigeo/http-spark-sql-server Spark Job Server solves inability to run multiple Spark contexts from the same JVM multiple Spark contexts with distinct JVM job submission in Java + Scala https://github.com/Atigeo/spark-job-rest … BIGthank you!!!