SlideShare a Scribd company logo
3
Most read
14
Most read
15
Most read
Getting Started
with
Apache Spark
Presented By
Manish Mishra
Pradyuman Pratap Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction to Big Data and Apache Spark
 What is Big Data?
 What is Apache Spark?
 Features of Apache Spark
2. Overview of Spark Architecture
3. Spark Components
4. Spark Basic & Programming Model
 Spark Context
 Spark Session
 RDD
 Dataframe
 RDD v/s Dataframe
5. Advantages of Apache Spark
6. Disadvantages of Apache Spark
7. Demo
Getting Started with Apache Spark (Scala)
What is Big Data?
Big Data means very large and complex sets
of information that are too big and fast for
traditional computer systems to handle. It
includes a wide variety of data types from many
sources.
It is characterized by the 5 Vs:
 Volume: Massive amounts of data.
 Velocity: Speed at which data is generated
and processed.
 Variety: Different types of data (structured,
semi-structured, unstructured).
 Veracity: Data quality and accuracy.
 Value: Value the data provides.
What is Apache Spark?
 Apache Spark is an open-source analytical processing engine for large-scale powerful
distributed data processing and machine learning applications. It can handle
both batches as well as real-time analytics and data processing workloads.
 It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently
use it for more types of computations, which includes interactive queries and stream
processing.
 The main feature of Spark is its in-memory computing that increases the processing
speed of an application.
Features of Apache Spark
01 02
03
05 06
04
In Memory Computation
Speed
Different Cluster Managers
Distributed Processing
Fault Tolerant
Lazy Evaluation
02
Apache Spark Architecture
03
Spark Components
Spark Core
Spark SQL
Supported
Languages
Spark
Streaming
Real Time
Mlib
Machine
Learning
GraphX
Graph
Processing
Scala Java Python R
Spark
Engine
Libraries
04
Spark Basics
1. Spark Context: SparkContext is the primary entry point to any spark functionality.
When we run any Spark application, a driver program starts, which has the main
function and your SparkContext gets initiated here. The driver program then runs the
operations inside the executors on worker nodes.
2. Spark Session: SparkSession is a unified entry point for Spark applications; it was
introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities,
including RDDs, DataFrames, and Datasets, providing a unified interface to work with
structured data processing.
RDD
 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
 There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
RDD Operation:
o Transformation
o Actions
Dataframe
 In Spark, Dataframe are the distributed
collections of data, organized into rows and
columns. Each column in a Dataframe has a
name and an associated type. Dataframe are
like traditional database tables, which are
structured and concise.
 We can say that Dataframe are relational
databases with better optimization
techniques.
 Spark Dataframe can be created from
various sources, such as Hive tables, log
tables, external databases, or the existing
RDDs. Dataframe allow the processing of
huge amounts of data.
RDD v/s Dataframe
Features RDD Dataframe
Data Format Structured and unstructured Structured and semi-structured
APIs
Provide a low-level API that requires
more code to perform transformations
and actions on data
Provide a high-level API that makes it
easier to perform transformations and
actions on data.
Schema enforcement
Do not have an explicit schema, and are
often used for unstructured data.
Dataframe enforce schema at runtime.
Have an explicit schema that
describes the data and its types.
Optimization
No inbuilt optimization engine is
available in RDD.
It uses a catalyst optimizer for
optimization.
05
Advantages of Apache Spark
 In Memory Computation
 Speed
 Ease of Use
 Advanced Analytics
 Fault Tolerant
 Multi Language Support
06
Disadvantages of Apache Spark
 Small Files Issue
 File Management System
 No automatic optimization process
 Fewer Algorithms
07
Getting Started with Apache Spark (Scala)

More Related Content

DOC
cassandra調査レポート
PDF
Optimizing MariaDB for maximum performance
PPTX
AWS database services
PPTX
PostGreSQL Performance Tuning
PPTX
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPTX
Zookeeper Tutorial for beginners
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
cassandra調査レポート
Optimizing MariaDB for maximum performance
AWS database services
PostGreSQL Performance Tuning
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
Spark SQL Deep Dive @ Melbourne Spark Meetup
Zookeeper Tutorial for beginners
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas

What's hot (20)

PDF
Spark overview
PPTX
No SQL- The Future Of Data Storage
PPTX
Apache Spark overview
PPTX
Apache Spark MLlib
PPTX
Microsoft Azure Temelleri - Modul 1
PPTX
Blockchain : Decentralized Application Development (Turkish)
PDF
Apache Zookeeper
PPTX
NoSQL databases - An introduction
PDF
Functional programming in Scala
PDF
Kafka 101 and Developer Best Practices
PDF
Albertsons’ Journey: Modernize and Migrate On-Premises Retail Systems to Cloud
PDF
Migration to Oracle Multitenant
PDF
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
PDF
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
PPTX
MySQL.pptx
PDF
Apache avro and overview hadoop tools
PPTX
MLaaS - Presenting & Scaling Machine Learning Models as Microservices
ODP
Apache ppt
PPT
RDBMS vs NoSQL
PPTX
Cassandra - Research Paper Overview
Spark overview
No SQL- The Future Of Data Storage
Apache Spark overview
Apache Spark MLlib
Microsoft Azure Temelleri - Modul 1
Blockchain : Decentralized Application Development (Turkish)
Apache Zookeeper
NoSQL databases - An introduction
Functional programming in Scala
Kafka 101 and Developer Best Practices
Albertsons’ Journey: Modernize and Migrate On-Premises Retail Systems to Cloud
Migration to Oracle Multitenant
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
MySQL.pptx
Apache avro and overview hadoop tools
MLaaS - Presenting & Scaling Machine Learning Models as Microservices
Apache ppt
RDBMS vs NoSQL
Cassandra - Research Paper Overview
Ad

Similar to Getting Started with Apache Spark (Scala) (20)

PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
Spark Unveiled Essential Insights for All Developers
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
Apache spark
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PPTX
Learn Apache Spark: A Comprehensive Guide
PPTX
Big Data Processing with Apache Spark 2014
PDF
Apache spark
PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
PPTX
Spark from the Surface
PDF
SparkPaper
PPTX
Spark_Talha.pptx
PPTX
An Introduction to Apache Spark
PPTX
Introduction to spark
PPTX
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
PDF
Apache Spark Notes
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Big_data_analytics_NoSql_Module-4_Session
Spark Unveiled Essential Insights for All Developers
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
A Master Guide To Apache Spark Application And Versatile Uses.pdf
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Unit II Real Time Data Processing tools.pptx
Apache spark
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Learn Apache Spark: A Comprehensive Guide
Big Data Processing with Apache Spark 2014
Apache spark
Marketing Strategyyguigiuiiiguooogu.pptx
Spark from the Surface
SparkPaper
Spark_Talha.pptx
An Introduction to Apache Spark
Introduction to spark
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
Apache Spark Notes
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Getting Started with Apache Spark (Scala)

  • 1. Getting Started with Apache Spark Presented By Manish Mishra Pradyuman Pratap Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction to Big Data and Apache Spark  What is Big Data?  What is Apache Spark?  Features of Apache Spark 2. Overview of Spark Architecture 3. Spark Components 4. Spark Basic & Programming Model  Spark Context  Spark Session  RDD  Dataframe  RDD v/s Dataframe 5. Advantages of Apache Spark 6. Disadvantages of Apache Spark 7. Demo
  • 5. What is Big Data? Big Data means very large and complex sets of information that are too big and fast for traditional computer systems to handle. It includes a wide variety of data types from many sources. It is characterized by the 5 Vs:  Volume: Massive amounts of data.  Velocity: Speed at which data is generated and processed.  Variety: Different types of data (structured, semi-structured, unstructured).  Veracity: Data quality and accuracy.  Value: Value the data provides.
  • 6. What is Apache Spark?  Apache Spark is an open-source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. It can handle both batches as well as real-time analytics and data processing workloads.  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory computing that increases the processing speed of an application.
  • 7. Features of Apache Spark 01 02 03 05 06 04 In Memory Computation Speed Different Cluster Managers Distributed Processing Fault Tolerant Lazy Evaluation
  • 8. 02
  • 10. 03
  • 11. Spark Components Spark Core Spark SQL Supported Languages Spark Streaming Real Time Mlib Machine Learning GraphX Graph Processing Scala Java Python R Spark Engine Libraries
  • 12. 04
  • 13. Spark Basics 1. Spark Context: SparkContext is the primary entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 2. Spark Session: SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and Datasets, providing a unified interface to work with structured data processing.
  • 14. RDD  Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. RDD Operation: o Transformation o Actions
  • 15. Dataframe  In Spark, Dataframe are the distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframe are like traditional database tables, which are structured and concise.  We can say that Dataframe are relational databases with better optimization techniques.  Spark Dataframe can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. Dataframe allow the processing of huge amounts of data.
  • 16. RDD v/s Dataframe Features RDD Dataframe Data Format Structured and unstructured Structured and semi-structured APIs Provide a low-level API that requires more code to perform transformations and actions on data Provide a high-level API that makes it easier to perform transformations and actions on data. Schema enforcement Do not have an explicit schema, and are often used for unstructured data. Dataframe enforce schema at runtime. Have an explicit schema that describes the data and its types. Optimization No inbuilt optimization engine is available in RDD. It uses a catalyst optimizer for optimization.
  • 17. 05
  • 18. Advantages of Apache Spark  In Memory Computation  Speed  Ease of Use  Advanced Analytics  Fault Tolerant  Multi Language Support
  • 19. 06
  • 20. Disadvantages of Apache Spark  Small Files Issue  File Management System  No automatic optimization process  Fewer Algorithms
  • 21. 07