Apache spark - Architecture , Overview & libraries

Walaa Assy
Giza Systems
Software
Developer
SPARK
LIGHTNING-FAST UNIFIED ANALYTICS
ENGINE

HOW DO WE HANDLE
EVER GROWING DATA
THAT HAS BECOME BIG
DATA?

Basics of Spark
Core API
 Cluster Managers
Spark Maintenance
Libraries
 - SQL
 - Streaming
 - Mllib
 GraphX
Troubleshooting /
Future of Spark
AGENDA

 Readability
 Expressiveness
 Fast
 Testability
 Interactive
 Fault Tolerant
 Unify Big Data
Spark officially sets a new record in large scale sorting, spark
does make computations on disk it makes use of cached data
in memory
WHY SPARK? TINIER CODE LEADS TO ..

 Map reduce has very narrow scope especially in batch
processing
 Each problem needed a new api to solve
EXPLOSION OF MAP REDUCE

A UNIFIED PLATFORM FOR BIG DATA

 The most basic abstraction of spark
 Spark operations are two main categories:
 Transformations [lazily evalutaed only storing the intent]
 Actions
 val textFile = sc.textFile("file:///spark/README.md")
 textFile.first // action
RDD [RESILIETNT DISTRIBUTION DATASET]

 sudo yum install wget
 sudo wget https://downloads.lightbend.com/scala/2.13.0-
M4/scala-2.13.0-M4.tgz
 tar xvf scala-2.13.0-M4.tgz
 sudo mv scala-2.13.0-M4 /usr/lib
 sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala
 export PATH=$PATH:/usr/lib/scala/bin
SCALA INSTALLATION STEPS

 sudo wget
https://www.apache.org/dyn/closer.lua/spark/spark-
2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
 tar xvf spark-2.3.1-bin-hadoop2.7.tgz
 ln -s spark-2.3.1-bin-hadoop2.7 spark
 export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7
 export PATH=$PATH:$SPARK_HOME/bin
SPARK INSTALLATION – CENTOS 7

 collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel…
 A collection similar to a list or an array from a user level
 processed in parallel to fasten computation time with no
failure tolerance
 RDD is immutable
 Transformations are lazy and stored in a DAG
 Actions trigger DAGs
 DAGS are like linear graph of tasks
 Each action will trigger a fresh execution of the graph
RDD

 Map
 Flatmap
 Filter
 Distinct
 Sample
 Union
 Inttersection
 Subtract
 Cartesian
Transformations return RDDs
TRANSFORMATIONS IN MAP REDUCE

 Collect()
 Count()
 Take(num)
 takeOrdered(num)(ordering)
 Reduce(function)
 Aggregate(zeroValue)(seqOp,compOp)
 Foreach(function)
 Actions return different types according to each action
saveAsObjectFile(path)
saveAsTextFile(path) // saves as text file
External connector
foreach(T => Unit) // one object at a time
 - foreachPartition(Iterator[T] => Unit) // one partition at a time
ACTIONS IN SPARK

 Sql like pairing
 Join
 fullOuterJoin
 leftJoin
 rightJoin
 Pair Saving
 saveAs(NewAPI)HadoopFile
 - path
 - keyClass
 - valueClass
 - outputFormatClass

saveAs(NewAPI)HadoopData
Set
 - conf
 saveAsSequenceFile
 Pair Saving
 - saveAsHadoopFile(path,
keyClass, valueClass,
SequenceFileOutputFormat)
PAIR METHODS- CONTD

 Works Like a distributed kernel
 Built in a basic spark manager
 Haddop cluster manager yarn
 Apache mesos standalone
PRIMARY CLUSTER MANAGER

 Spark SQL is Apache Spark's module for working with
structured or semi data.
 It is meant to be used by non big data users
 As Spark continues to grow, we want to enable wider
audiences beyond “Big Data” engineers to leverage the power
of distributed processing.
Databricks blog (http://bit.ly/17NM70s)
SPARK SQL

 Seamlessly mix SQL queries with Spark programs
Spark SQL lets you query structured data inside Spark programs,
using either SQL or a familiar DataFrame API
 Connect to any data source the same way.
 It executes SQL queries.
 We can read data from existing Hive installation using
SparkSQL.
 When we run SQL within another programming language we
will get the result as Dataset/DataFrame.
SPARK SQL FEATURES

DataFrames and SQL provide a common way to access a variety
of data sources, including Hive, Avro, Parquet, ORC, JSON, and
JDBC. You can even join data across these sources.
 Run SQL or HiveQL queries on existing warehouses.[Hive
Integration]
 Connect through JDBC or ODBC.[Standard Connectivity]
 It is includes with spark
DATAFRAMES

 Spark 1.3 release. It is a distributed collection of data
ordered into named columns. Concept wise it is equal to the
table in a relational database or a data frame in R/Python.
We can create DataFrame using:
 Structured data files
 Tables in Hive
 External databases
 Using existing RDD
SPARK DATAFRAME IS
Data frames = schem RDD

 Hive
 Parquet
 Json
 Avro
 Amazon red shift
 Csv
 Others
It is recommended as a starting point for any spark application
As it adds
 Predicate push down
 Column pruning
 Can use SQL & RDD
SPARK SQL DATA SOURCES

 Big & fast data
 Gigabytes per second
 Real time fraud detection
 Marketing
 makes it easy to build scalable fault-tolerant streaming
applications.
SPARK STREAMING

SPARK STREAMING COMPETITORS
Streaming data
• Kafka
• Flume
• Twitter
• Hadoop hdfs
• Others
• live logs, system telemetry data, IoT device
data, etc.)

 MLlib is a standard component of Spark providing machine
learning primitives on top of Spark.
SPARK MLIB

 MATLAB
 R
EASY TO USE BUT NOT SCALABLE
 MAHOUT
 GRAPHLAB
Scalable but at the cost ease
 Org.apache.spark.mlib
Rdd based algoritms
 Org.aoache.spark.ml
 Pipleline api built on top of dataframes
SPARK MLIB COMPETITION

 Loding the data
 Extracting features
 Training the data
 Testing
 the data
 The new pipeline allows tuning testing and early failure
detection
MACHINE LEARNING FLOW

 Algorithms
Classifcation ex: naïve bayes
Regression
Linear
Logistic
Filtering by als ,k squares
Clustering by k-means
Dimensional reduction by SVD singular value decomposition
 Feature extraction and transformations
Tf-idf : term frequency- inverse document frequency
ALGRITHMS IN MLIB

 Spam filtering
 Fraud detection
 Recommendation analysis
 Speech recognition
PRACTICAL USE

 Word to vector algorithm
 This algorithm takes an input text and outputs a set of vectors
representing a dictionary of words [to see word similarity]
 We cache the rdds because mlib will have multiple passes o
the same data so this memory cache can reduce processing
time alot
 breeze numerical processing library used inside of spark
 It has ability to perform mathematical operations on vectors
MLIB DEMO

 GraphX is Apache Spark's API for graphs and graph-parallel
computation.
 Page ranking
 Producing evaluations
 It can be used in genetic analysis
 ALGORITHMS
 PageRank
 Connected components
 Label propagation
 SVD++
 Strongly connected components
 Triangle count
GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED
WORLD

COMPETITONS
End-to-end PageRank performance (20 iterations,
3.7B edges)

 Joints each had unique id
 Each vertex can has properties of user defined type and store
metal data
ARCHITECTURE

 Arrows are relations that can store metadata data known as
edges which is a long type
 A graph is built of two RDDs one containing the collection of
edges and the collection of vertices

 Another component is edge triplet is an object which exposes
the relation between each vertex and edge containing all the
information for each connection

 http://spark.apache.org
 Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
 Github: https://github.com/apache/spark
 https://data-flair.training/blogs/spark-sql-tutorial/
REFERENCES

Apache spark - Architecture , Overview & libraries

More Related Content

What's hot (20)

Similar to Apache spark - Architecture , Overview & libraries (20)

Recently uploaded (20)

Apache spark - Architecture , Overview & libraries