SlideShare a Scribd company logo
Walaa Assy
Giza Systems
Software
Developer
SPARK
LIGHTNING-FAST UNIFIED ANALYTICS
ENGINE
HOW DO WE HANDLE
EVER GROWING DATA
THAT HAS BECOME BIG
DATA?
Basics of Spark
Core API
 Cluster Managers
Spark Maintenance
Libraries
 - SQL
 - Streaming
 - Mllib
 GraphX
Troubleshooting /
Future of Spark
AGENDA
Apache spark - Architecture , Overview & libraries
 Readability
 Expressiveness
 Fast
 Testability
 Interactive
 Fault Tolerant
 Unify Big Data
Spark officially sets a new record in large scale sorting, spark
does make computations on disk it makes use of cached data
in memory
WHY SPARK? TINIER CODE LEADS TO ..
 Map reduce has very narrow scope especially in batch
processing
 Each problem needed a new api to solve
EXPLOSION OF MAP REDUCE
Apache spark - Architecture , Overview & libraries
A UNIFIED PLATFORM FOR BIG DATA
SPARK PROGRAMMING LANGUAGES
 The most basic abstraction of spark
 Spark operations are two main categories:
 Transformations [lazily evalutaed only storing the intent]
 Actions
 val textFile = sc.textFile("file:///spark/README.md")
 textFile.first // action
RDD [RESILIETNT DISTRIBUTION DATASET]
HELLO BIG DATA
 sudo yum install wget
 sudo wget https://downloads.lightbend.com/scala/2.13.0-
M4/scala-2.13.0-M4.tgz
 tar xvf scala-2.13.0-M4.tgz
 sudo mv scala-2.13.0-M4 /usr/lib
 sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala
 export PATH=$PATH:/usr/lib/scala/bin
SCALA INSTALLATION STEPS
 sudo wget
https://www.apache.org/dyn/closer.lua/spark/spark-
2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
 tar xvf spark-2.3.1-bin-hadoop2.7.tgz
 ln -s spark-2.3.1-bin-hadoop2.7 spark
 export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7
 export PATH=$PATH:$SPARK_HOME/bin
SPARK INSTALLATION – CENTOS 7
SPARK MECHANISM
 collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel…
 A collection similar to a list or an array from a user level
 processed in parallel to fasten computation time with no
failure tolerance
 RDD is immutable
 Transformations are lazy and stored in a DAG
 Actions trigger DAGs
 DAGS are like linear graph of tasks
 Each action will trigger a fresh execution of the graph
RDD
INPUT DATASETS TYPES
Apache spark - Architecture , Overview & libraries
 Map
 Flatmap
 Filter
 Distinct
 Sample
 Union
 Inttersection
 Subtract
 Cartesian
Transformations return RDDs
TRANSFORMATIONS IN MAP REDUCE
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 Collect()
 Count()
 Take(num)
 takeOrdered(num)(ordering)
 Reduce(function)
 Aggregate(zeroValue)(seqOp,compOp)
 Foreach(function)
 Actions return different types according to each action
saveAsObjectFile(path)
saveAsTextFile(path) // saves as text file
External connector
foreach(T => Unit) // one object at a time
 - foreachPartition(Iterator[T] => Unit) // one partition at a time
ACTIONS IN SPARK
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 Sql like pairing
 Join
 fullOuterJoin
 leftJoin
 rightJoin
 Pair Saving
 saveAs(NewAPI)HadoopFile
 - path
 - keyClass
 - valueClass
 - outputFormatClass

saveAs(NewAPI)HadoopData
Set
 - conf
 saveAsSequenceFile
 Pair Saving
 - saveAsHadoopFile(path,
keyClass, valueClass,
SequenceFileOutputFormat)
PAIR METHODS- CONTD
 Works Like a distributed kernel
 Built in a basic spark manager
 Haddop cluster manager yarn
 Apache mesos standalone
PRIMARY CLUSTER MANAGER
SPARK-SUBMIT DEMO
SPARK SQL
 Spark SQL is Apache Spark's module for working with
structured or semi data.
 It is meant to be used by non big data users
 As Spark continues to grow, we want to enable wider
audiences beyond “Big Data” engineers to leverage the power
of distributed processing.
Databricks blog (http://bit.ly/17NM70s)
SPARK SQL
 Seamlessly mix SQL queries with Spark programs
Spark SQL lets you query structured data inside Spark programs,
using either SQL or a familiar DataFrame API
 Connect to any data source the same way.
 It executes SQL queries.
 We can read data from existing Hive installation using
SparkSQL.
 When we run SQL within another programming language we
will get the result as Dataset/DataFrame.
SPARK SQL FEATURES
Apache spark - Architecture , Overview & libraries
DataFrames and SQL provide a common way to access a variety
of data sources, including Hive, Avro, Parquet, ORC, JSON, and
JDBC. You can even join data across these sources.
 Run SQL or HiveQL queries on existing warehouses.[Hive
Integration]
 Connect through JDBC or ODBC.[Standard Connectivity]
 It is includes with spark
DATAFRAMES
 Spark 1.3 release. It is a distributed collection of data
ordered into named columns. Concept wise it is equal to the
table in a relational database or a data frame in R/Python.
We can create DataFrame using:
 Structured data files
 Tables in Hive
 External databases
 Using existing RDD
SPARK DATAFRAME IS
Data frames = schem RDD
EXAMPLES
SPARK SQL COMPETITION
 Hive
 Parquet
 Json
 Avro
 Amazon red shift
 Csv
 Others
It is recommended as a starting point for any spark application
As it adds
 Predicate push down
 Column pruning
 Can use SQL & RDD
SPARK SQL DATA SOURCES
SPARK STREAMING
 Big & fast data
 Gigabytes per second
 Real time fraud detection
 Marketing
 makes it easy to build scalable fault-tolerant streaming
applications.
SPARK STREAMING
SPARK STREAMING COMPETITORS
Streaming data
• Kafka
• Flume
• Twitter
• Hadoop hdfs
• Others
• live logs, system telemetry data, IoT device
data, etc.)
SPARK MLIB
 MLlib is a standard component of Spark providing machine
learning primitives on top of Spark.
SPARK MLIB
 MATLAB
 R
EASY TO USE BUT NOT SCALABLE
 MAHOUT
 GRAPHLAB
Scalable but at the cost ease
 Org.apache.spark.mlib
Rdd based algoritms
 Org.aoache.spark.ml
 Pipleline api built on top of dataframes
SPARK MLIB COMPETITION
 Loding the data
 Extracting features
 Training the data
 Testing
 the data
 The new pipeline allows tuning testing and early failure
detection
MACHINE LEARNING FLOW
 Algorithms
Classifcation ex: naïve bayes
Regression
Linear
Logistic
Filtering by als ,k squares
Clustering by k-means
Dimensional reduction by SVD singular value decomposition
 Feature extraction and transformations
Tf-idf : term frequency- inverse document frequency
ALGRITHMS IN MLIB
 Spam filtering
 Fraud detection
 Recommendation analysis
 Speech recognition
PRACTICAL USE
 Word to vector algorithm
 This algorithm takes an input text and outputs a set of vectors
representing a dictionary of words [to see word similarity]
 We cache the rdds because mlib will have multiple passes o
the same data so this memory cache can reduce processing
time alot
 breeze numerical processing library used inside of spark
 It has ability to perform mathematical operations on vectors
MLIB DEMO
SPARK GRAPHX
 GraphX is Apache Spark's API for graphs and graph-parallel
computation.
 Page ranking
 Producing evaluations
 It can be used in genetic analysis
 ALGORITHMS
 PageRank
 Connected components
 Label propagation
 SVD++
 Strongly connected components
 Triangle count
GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED
WORLD
COMPETITONS
End-to-end PageRank performance (20 iterations,
3.7B edges)
 Joints each had unique id
 Each vertex can has properties of user defined type and store
metal data
ARCHITECTURE
 Arrows are relations that can store metadata data known as
edges which is a long type
 A graph is built of two RDDs one containing the collection of
edges and the collection of vertices
 Another component is edge triplet is an object which exposes
the relation between each vertex and edge containing all the
information for each connection
WHO IS USING SPARK?
Apache spark - Architecture , Overview & libraries
 http://spark.apache.org
 Tutorials: http://ampcamp.berkeley.edu
 Spark Summit: http://spark-summit.org
 Github: https://github.com/apache/spark
 https://data-flair.training/blogs/spark-sql-tutorial/
REFERENCES

More Related Content

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Apache Spark Overview
PDF
What Is RDD In Spark? | Edureka
PPTX
Apache Spark overview
PDF
Spark SQL
PPTX
Introduction to Apache Spark
PDF
Apache Spark Introduction
PPTX
Apache Spark Fundamentals
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark Overview
What Is RDD In Spark? | Edureka
Apache Spark overview
Spark SQL
Introduction to Apache Spark
Apache Spark Introduction
Apache Spark Fundamentals

What's hot (20)

PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Introduction to spark
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Spark architecture
PPTX
Introduction to Pig
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Intro to Apache Spark
PDF
Parquet performance tuning: the missing guide
PDF
Introduction to Apache Spark
PPTX
Map Reduce
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
Spark overview
PDF
Introduction to apache spark
PDF
Hadoop and Spark
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Apache Spark 101
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to spark
Apache Spark in Depth: Core Concepts, Architecture & Internals
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Simplifying Big Data Analytics with Apache Spark
Spark architecture
Introduction to Pig
Hive + Tez: A Performance Deep Dive
Intro to Apache Spark
Parquet performance tuning: the missing guide
Introduction to Apache Spark
Map Reduce
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Processing Large Data with Apache Spark -- HasGeek
Learn Apache Spark: A Comprehensive Guide
Spark overview
Introduction to apache spark
Hadoop and Spark
Spark SQL Deep Dive @ Melbourne Spark Meetup
Apache Spark 101
Ad

Similar to Apache spark - Architecture , Overview & libraries (20)

PDF
Spark meetup TCHUG
PPTX
SPARK ARCHITECTURE
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Apache Spark Presentation good for big data
PDF
Introduction to apache spark
PPTX
Apache Spark for Beginners
PPTX
Evolution of spark framework for simplifying data analysis.
PDF
Unified Big Data Processing with Apache Spark
PDF
20150716 introduction to apache spark v3
PDF
Big Data Analytics and Ubiquitous computing
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Apache Spark Overview @ ferret
PDF
Bds session 13 14
PDF
Apache spark
PDF
Dev Ops Training
PPTX
In Memory Analytics with Apache Spark
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PPTX
APACHE SPARK.pptx
Spark meetup TCHUG
SPARK ARCHITECTURE
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache Spark Presentation good for big data
Introduction to apache spark
Apache Spark for Beginners
Evolution of spark framework for simplifying data analysis.
Unified Big Data Processing with Apache Spark
20150716 introduction to apache spark v3
Big Data Analytics and Ubiquitous computing
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Overview @ ferret
Bds session 13 14
Apache spark
Dev Ops Training
In Memory Analytics with Apache Spark
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big_data_analytics_NoSql_Module-4_Session
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
APACHE SPARK.pptx
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Big Data Technologies - Introduction.pptx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Advanced IT Governance
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Big Data Technologies - Introduction.pptx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
Advanced IT Governance
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf

Apache spark - Architecture , Overview & libraries

  • 2. HOW DO WE HANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?
  • 3. Basics of Spark Core API  Cluster Managers Spark Maintenance Libraries  - SQL  - Streaming  - Mllib  GraphX Troubleshooting / Future of Spark AGENDA
  • 5.  Readability  Expressiveness  Fast  Testability  Interactive  Fault Tolerant  Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..
  • 6.  Map reduce has very narrow scope especially in batch processing  Each problem needed a new api to solve EXPLOSION OF MAP REDUCE
  • 8. A UNIFIED PLATFORM FOR BIG DATA
  • 10.  The most basic abstraction of spark  Spark operations are two main categories:  Transformations [lazily evalutaed only storing the intent]  Actions  val textFile = sc.textFile("file:///spark/README.md")  textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]
  • 12.  sudo yum install wget  sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz  tar xvf scala-2.13.0-M4.tgz  sudo mv scala-2.13.0-M4 /usr/lib  sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala  export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS
  • 13.  sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz  tar xvf spark-2.3.1-bin-hadoop2.7.tgz  ln -s spark-2.3.1-bin-hadoop2.7 spark  export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7  export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7
  • 15.  collection of elements partitioned across the nodes of the cluster that can be operated on in parallel…  A collection similar to a list or an array from a user level  processed in parallel to fasten computation time with no failure tolerance  RDD is immutable  Transformations are lazy and stored in a DAG  Actions trigger DAGs  DAGS are like linear graph of tasks  Each action will trigger a fresh execution of the graph RDD
  • 18.  Map  Flatmap  Filter  Distinct  Sample  Union  Inttersection  Subtract  Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE
  • 24.  Collect()  Count()  Take(num)  takeOrdered(num)(ordering)  Reduce(function)  Aggregate(zeroValue)(seqOp,compOp)  Foreach(function)  Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time  - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK
  • 28.  Sql like pairing  Join  fullOuterJoin  leftJoin  rightJoin  Pair Saving  saveAs(NewAPI)HadoopFile  - path  - keyClass  - valueClass  - outputFormatClass  saveAs(NewAPI)HadoopData Set  - conf  saveAsSequenceFile  Pair Saving  - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD
  • 29.  Works Like a distributed kernel  Built in a basic spark manager  Haddop cluster manager yarn  Apache mesos standalone PRIMARY CLUSTER MANAGER
  • 32.  Spark SQL is Apache Spark's module for working with structured or semi data.  It is meant to be used by non big data users  As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL
  • 33.  Seamlessly mix SQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API  Connect to any data source the same way.  It executes SQL queries.  We can read data from existing Hive installation using SparkSQL.  When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES
  • 35. DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.  Run SQL or HiveQL queries on existing warehouses.[Hive Integration]  Connect through JDBC or ODBC.[Standard Connectivity]  It is includes with spark DATAFRAMES
  • 36.  Spark 1.3 release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using:  Structured data files  Tables in Hive  External databases  Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD
  • 39.  Hive  Parquet  Json  Avro  Amazon red shift  Csv  Others It is recommended as a starting point for any spark application As it adds  Predicate push down  Column pruning  Can use SQL & RDD SPARK SQL DATA SOURCES
  • 41.  Big & fast data  Gigabytes per second  Real time fraud detection  Marketing  makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING
  • 42. SPARK STREAMING COMPETITORS Streaming data • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)
  • 44.  MLlib is a standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB
  • 45.  MATLAB  R EASY TO USE BUT NOT SCALABLE  MAHOUT  GRAPHLAB Scalable but at the cost ease  Org.apache.spark.mlib Rdd based algoritms  Org.aoache.spark.ml  Pipleline api built on top of dataframes SPARK MLIB COMPETITION
  • 46.  Loding the data  Extracting features  Training the data  Testing  the data  The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW
  • 47.  Algorithms Classifcation ex: naïve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition  Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB
  • 48.  Spam filtering  Fraud detection  Recommendation analysis  Speech recognition PRACTICAL USE
  • 49.  Word to vector algorithm  This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity]  We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot  breeze numerical processing library used inside of spark  It has ability to perform mathematical operations on vectors MLIB DEMO
  • 51.  GraphX is Apache Spark's API for graphs and graph-parallel computation.  Page ranking  Producing evaluations  It can be used in genetic analysis  ALGORITHMS  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD
  • 52. COMPETITONS End-to-end PageRank performance (20 iterations, 3.7B edges)
  • 53.  Joints each had unique id  Each vertex can has properties of user defined type and store metal data ARCHITECTURE
  • 54.  Arrows are relations that can store metadata data known as edges which is a long type  A graph is built of two RDDs one containing the collection of edges and the collection of vertices
  • 55.  Another component is edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection
  • 56. WHO IS USING SPARK?
  • 58.  http://spark.apache.org  Tutorials: http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org  Github: https://github.com/apache/spark  https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES