SlideShare a Scribd company logo
View Apache Spark and Scala
course details at www.edureka.co/apache-spark-scala-training
Apache Spark | Spark SQL
Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2
Objectives
At the end of this module, you will be able to
 Introduction of Spark
 Spark Architecture
 What is an RDD
 Demo On Creating RDD and Running sample example
 Spark SQL
Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3
What is Spark?
Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it
easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.
 Developed at UC Berkeley
Written in Scala , a Functional Programming Language that runs in a JMV
It generalize the Map Reduce framework
Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4
Why Spark ?
Speed
Run programs up to 100x
faster than Hadoop Map
Reduce in memory, or 10x
faster on disk.
Ease of Use
Supports different
languages for developing
applications using Spark
Generality
Combine SQL, streaming,
and complex analytics into
one platform
Runs Everywhere
Spark runs on Hadoop,
Mesos, standalone, or in
the cloud.
Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5
Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass
computations and algorithms ( Machine learning etc.)
To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in
sequence
 Each of those jobs was high-latency, and none could start until the previous job had finished completely
The Job output data between each step has to be stored in the local file system before the next step can begin
 Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning
and Storm for streaming data processing)
Map Reduce Limitations
Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6
Spark Features
 Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-
memory data storage
 Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
 It’s designed to be an execution engine that works both in-memory and on-disk
 Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow
 Provides concise and consistent APIs in Scala, Java and Python
 Offers interactive shell for Scala and Python. This is not available in Java yet
 Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7
Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8
Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Cluster management ( Native Spark Cluster, YARN, MESOS )
Distributed storage ( HDFS, Cassandra, S3, HBase )
Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9
Spark Advantages
EASE OF
DEVELOPMENT
COMBINE
WORKFLOWS
IN-MEMORY
PERFORMANCE
 Easier APIs
 Python, Scala, Java
 RDDs
 DAGs Unify Processing
 Shark, ML
Streaming, GraphX
Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
 Multiple data sources
 Multiple applications
 Multiple users
 Reliability
 Multi-tenancy
 Security
 Files
 Databases
 Semi-structured
Hadoop Advantages
Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11
Spark + Hadoop
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
EASE OF
DEVELOPMENT
COMBINE WORKFLOWS
IN-MEMORY
PERFORMANCE
Operational Applications
Augmented by In-Memory
Performance
Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12
Resilient Distributed Datasets
RDD ( Resilient Distributed Data Sets )
Resilient – If data in memory is lost, It can be recreated
Distributed – Stored in memory across the cluster
Dataset – Initial data can come from a file or created programmatically.
RDDs are the fundamental unit of data in spark
Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13
Resilient Distributed Datasets
Core concept of Spark framework.
RDDs can store any type of data.
Primitive Types : Integer, Characters, Boolean etc.
Files : Text files, SequencFiles etc.
RDD is fault tolerance.
RDDs are immutable
Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14
RDD supports two types of operations:
Transformation: Transformations don't return a single value, they return a new RDD.
Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and
coalesce.
Action: Action operation evaluates and returns a new value.
Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.
Resilient Distributed Datasets
Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15
Spark Sql
Spark Core
 Spark SQL allows relational queries through Spark
 The backbone for all these operations is SchemaRDD
 Schema RDDs are mode of row objects along with the metadata information
 SchemaRDDs are equivalent to RDBMS tables
 They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data
stored in Apache Hive(*)
Spark SQL
Slide 16 www.edureka.co/apache-spark-scala-training
Spark SQL
Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with
integrated APIs in Scala and Java
 Shark Project is completely closed now
Earlier it was Shark but now
we will use Spark SQL
Shark
Spark SQL Hive on Spark
Development ending:
transitioning to Spark SQL
A new SQL engine designed
from ground up for Spark
Help existing Hive users
migrate Spark
Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17
Efficient In-Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-object overhead
Instead, Spark SQL employs column-oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18
Demo On Spark RDDs
Slide 19 www.edureka.co/apache-spark-scala-training
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
Course Features
Slide 20 www.edureka.co/apache-spark-scala-training
Questions
Slide 21 www.edureka.co/apache-spark-scala-training
Course Topics
 Module 1
» Introduction to Scala
 Module 2
» Scala Essentials
 Module 3
» Traits and OOPs in Scala
 Module 4
» Functional Programming in Scala
Module 5
» Introduction to Big Data and Spark
Module 6
» Spark Baby Steps
Module 7
» Playing with RDDs
Module 8
» Spark with SQL- When Spark meets Hive
Slide 22 www.edureka.co/apache-spark-scala-training

More Related Content

PDF
Dynamic Partition Pruning in Apache Spark
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Introduction to PySpark
PPTX
Introduction to Pig
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Dynamic Partition Pruning in Apache Spark
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Apache Spark Core—Deep Dive—Proper Optimization
Introduction to PySpark
Introduction to Pig
Fine Tuning and Enhancing Performance of Apache Spark Jobs

What's hot (20)

PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PPTX
Real-time Hadoop: The Ideal Messaging System for Hadoop
PPTX
Introduction to spark
PDF
Best Practices in the Use of Columnar Databases
PPTX
Learn Apache Spark: A Comprehensive Guide
PDF
Introduction to Big Data
PDF
Apache Spark Introduction
PPTX
DNS Security Presentation ISSA
PDF
Introduction to spark
PDF
Introduction to apache spark
PPTX
Frame - Feature Management for Productive Machine Learning
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Apache Spark Architecture
PPTX
Introduction to Apache Spark
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
PDF
[Pgday.Seoul 2018] Greenplum의 노드 분산 설계
PPTX
Spark introduction and architecture
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Real-time Hadoop: The Ideal Messaging System for Hadoop
Introduction to spark
Best Practices in the Use of Columnar Databases
Learn Apache Spark: A Comprehensive Guide
Introduction to Big Data
Apache Spark Introduction
DNS Security Presentation ISSA
Introduction to spark
Introduction to apache spark
Frame - Feature Management for Productive Machine Learning
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Big Data Architecture Workshop - Vahid Amiri
Apache Spark Architecture
Introduction to Apache Spark
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
[Pgday.Seoul 2018] Greenplum의 노드 분산 설계
Spark introduction and architecture
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Apache Spark in Depth: Core Concepts, Architecture & Internals
Ad

Viewers also liked (15)

PPTX
Apache Spark & Scala
PDF
Spark Streaming
PPTX
Apache Storm Internals
PDF
Spark SQL | Apache Spark
PPTX
5 things one must know about spark!
PDF
Spark For Faster Batch Processing
PPT
Scala and spark
PDF
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
PDF
Apache Zeppelin으로 데이터 분석하기
PDF
지금 핫한 Real-time In-memory Stream Processing 이야기
PPTX
Spark machine learning & deep learning
PDF
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
PDF
Storm: distributed and fault-tolerant realtime computation
PDF
Realtime Analytics with Storm and Hadoop
Apache Spark & Scala
Spark Streaming
Apache Storm Internals
Spark SQL | Apache Spark
5 things one must know about spark!
Spark For Faster Batch Processing
Scala and spark
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Apache Zeppelin으로 데이터 분석하기
지금 핫한 Real-time In-memory Stream Processing 이야기
Spark machine learning & deep learning
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Ad

Similar to Big Data Processing With Spark (20)

PPTX
5 reasons why spark is in demand!
PDF
Big Data Processing with Spark and Scala
PDF
Apache spark
PDF
Module01
PDF
5 things one must know about spark!
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
PDF
Apache Spark beyond Hadoop MapReduce
PDF
5 Reasons why Spark is in demand!
PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
PPTX
Apache Spark Overview
PDF
Apache Spark Introduction.pdf
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
Apache spark installation [autosaved]
PPTX
Apache spark
PDF
Introduction to apache spark and the architecture
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PDF
spark_v1_2
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PPTX
Introduction to Apache Spark Developer Training
5 reasons why spark is in demand!
Big Data Processing with Spark and Scala
Apache spark
Module01
5 things one must know about spark!
Spark Concepts Cheat Sheet_Interview_Question.pdf
Apache Spark beyond Hadoop MapReduce
5 Reasons why Spark is in demand!
Marketing Strategyyguigiuiiiguooogu.pptx
Apache Spark Overview
Apache Spark Introduction.pdf
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache spark installation [autosaved]
Apache spark
Introduction to apache spark and the architecture
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
spark_v1_2
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Introduction to Apache Spark Developer Training

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Encapsulation_ Review paper, used for researhc scholars
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development

Big Data Processing With Spark

  • 1. View Apache Spark and Scala course details at www.edureka.co/apache-spark-scala-training Apache Spark | Spark SQL
  • 2. Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2 Objectives At the end of this module, you will be able to  Introduction of Spark  Spark Architecture  What is an RDD  Demo On Creating RDD and Running sample example  Spark SQL
  • 3. Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3 What is Spark? Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.  Developed at UC Berkeley Written in Scala , a Functional Programming Language that runs in a JMV It generalize the Map Reduce framework
  • 4. Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4 Why Spark ? Speed Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. Ease of Use Supports different languages for developing applications using Spark Generality Combine SQL, streaming, and complex analytics into one platform Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud.
  • 5. Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5 Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms ( Machine learning etc.) To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in sequence  Each of those jobs was high-latency, and none could start until the previous job had finished completely The Job output data between each step has to be stored in the local file system before the next step can begin  Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing) Map Reduce Limitations
  • 6. Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6 Spark Features  Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in- memory data storage  Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing  It’s designed to be an execution engine that works both in-memory and on-disk  Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow  Provides concise and consistent APIs in Scala, Java and Python  Offers interactive shell for Scala and Python. This is not available in Java yet  Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
  • 7. Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture
  • 8. Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture Cluster management ( Native Spark Cluster, YARN, MESOS ) Distributed storage ( HDFS, Cassandra, S3, HBase )
  • 9. Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9 Spark Advantages EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE  Easier APIs  Python, Scala, Java  RDDs  DAGs Unify Processing  Shark, ML Streaming, GraphX
  • 10. Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10 UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM  Multiple data sources  Multiple applications  Multiple users  Reliability  Multi-tenancy  Security  Files  Databases  Semi-structured Hadoop Advantages
  • 11. Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11 Spark + Hadoop UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE Operational Applications Augmented by In-Memory Performance
  • 12. Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12 Resilient Distributed Datasets RDD ( Resilient Distributed Data Sets ) Resilient – If data in memory is lost, It can be recreated Distributed – Stored in memory across the cluster Dataset – Initial data can come from a file or created programmatically. RDDs are the fundamental unit of data in spark
  • 13. Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13 Resilient Distributed Datasets Core concept of Spark framework. RDDs can store any type of data. Primitive Types : Integer, Characters, Boolean etc. Files : Text files, SequencFiles etc. RDD is fault tolerance. RDDs are immutable
  • 14. Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14 RDD supports two types of operations: Transformation: Transformations don't return a single value, they return a new RDD. Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. Action: Action operation evaluates and returns a new value. Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach. Resilient Distributed Datasets
  • 15. Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15 Spark Sql Spark Core  Spark SQL allows relational queries through Spark  The backbone for all these operations is SchemaRDD  Schema RDDs are mode of row objects along with the metadata information  SchemaRDDs are equivalent to RDBMS tables  They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data stored in Apache Hive(*) Spark SQL
  • 16. Slide 16 www.edureka.co/apache-spark-scala-training Spark SQL Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Scala and Java  Shark Project is completely closed now Earlier it was Shark but now we will use Spark SQL Shark Spark SQL Hive on Spark Development ending: transitioning to Spark SQL A new SQL engine designed from ground up for Spark Help existing Hive users migrate Spark
  • 17. Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17 Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Spark SQL employs column-oriented storage using arrays of primitive types 1 Column Storage 2 3 john mike sally 4.1 3.5 6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4
  • 19. Slide 19 www.edureka.co/apache-spark-scala-training LIVE Online Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work Verifiable Certificate Course Features
  • 21. Slide 21 www.edureka.co/apache-spark-scala-training Course Topics  Module 1 » Introduction to Scala  Module 2 » Scala Essentials  Module 3 » Traits and OOPs in Scala  Module 4 » Functional Programming in Scala Module 5 » Introduction to Big Data and Spark Module 6 » Spark Baby Steps Module 7 » Playing with RDDs Module 8 » Spark with SQL- When Spark meets Hive