Cluj meetup bigdata-final-version

Cluj – Timisoara – Bucharest => Big Data meetup group
Cluj-Romania March 2015
Apache Spark
The next level for data processing
Radu MOLDOVAN

About me

20 years of programming mostly with open source projects

last 3 years worked in Big Data

team lead@ building the 5th
generation of
xPatterns Platform
mail: radu.adrian.moldovan@gmail.com, skype: r.moldovan

Content

- modules

- programming model (RDDs)

- architecture

- Spark & Mesos (cluster Resource Manager)

- demo (Spark Core + Spark SQL - Scala code)
Spark experience [0.8.0 → 1.2.0]
Source: spark.apache.org/

SPARK - introduction
- started in 2009 by Matei Zaharia at Berkeley
AMPLab (Ion Stoica)
- cluster computing platform (scalability &
flexibility)
- in-memory data processing for large datasets
- best alternative for Hadoop Map-Reduce,
compatible with Hadoop eco-system(HDFS &
fileFormat)


Core: Scala, Java, Python

Streaming: real-time streams processing

SQL: structured data and queries

MLlib: built-in machine learning libraries

GraphX: graph processing
SPARK - modules

SPARK – programming model
RDDs (Resilient Distributed DataSet)
- distributed and immutable collection of items + fault tolerance(auto rebuild on failure)
- create RDDs (from files(HDFS/Tachyon), parallelize data, from other RDDs)
- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)
Operations (much more than map-reduce) – more than 80
- Transformation(new RDDs):
map, filter, join, distinct, union, intersection, groupByKey, reduceByKey
- Actions(data on driver):
count, collect, foreach, reduce(func), first, saveAs{Text, Sequence,Object}File
!!! nested RDDs are not supported

SPARK - architecture
1 Job(driver program) –> [1..n] Stages
1 Stage – [1..m] Tasks

SPARK – Core – demo part 2

SPARK – Core – demo part 3

0.8.0 - first POC … lots of OOM
0.8.1 – first production deployment, still lots of OOM

20 billion healthcare records, 200 TB of compressed hdfs data, ¾ PB of uncompressed data

Hadoop MR: 100Amazon instances - m1.xlarge (4cores x 15GB RAM x 2-4TB HDD)

Daily processing reduced from 14 hours to 1.5 hours!
0.9.0 - fixed many of the problems, but still requires patches! spilling on the
reducer side fixed (less OOM)
1.1.0 & 1.2.0 extensive usage of SchemaRdds, Rows & Parquet file
format
1.3.0 – just released this year
SPARK experience [0.8.0 → 1.2.0]

…what's next for Bigdata meetup?
Contributors
Jaws, xPatterns http spark sql server!
Restful service for running Spark SQL/Shark queries on top of Spark
- Spark 0.9.1 with Shark
- Spark 1.0.1, 1.0.2, 1.1.0 and 1.0.2 with SparkSQL as backend
framework.
Backend in spray io (REST on Akka)
http://github.com/Atigeo/http-spark-sql-server
Spark Job Server
solves inability to run multiple Spark contexts from the same JVM
multiple Spark contexts with distinct JVM
job submission in Java + Scala
https://github.com/Atigeo/spark-job-rest
… BIGthank you!!!

Cluj meetup bigdata-final-version

Cluj meetup bigdata-final-version

More Related Content

What's hot (20)

Similar to Cluj meetup bigdata-final-version (20)

Recently uploaded (20)

Cluj meetup bigdata-final-version