SlideShare a Scribd company logo
SUBMITTED TO: SUBMITTED BY:
Mrs. Suman singh Nikita Vijay
(HOD of CSE Dept.) B. Tech –VIII sem(CSE)
A SEMINAR PRESENTATION ON
“Introduction To Apache Spark”
● Need of new generation distributed system
● Hardware/software evolution in last decade
● Apache Spark
• Components of Apache Spark
● Why Spark?
● Who are using Spark?
Agenda
● Lot has been changed from 2000
● Both hardware and software gone through changes
● Big data has become necessity now
● Let’s look at what changed over decade
Why we need new generation?
● Disk was cheap so disk was primary source of data
● Network was costly so data locality
● RAM was very costly
● Single core machines were dominant
RAM is the king
• RAM is primary source of data and we use disk for
fallback
● Network is speedier
● Multi core machines are commonplace
State of hardware in 2000
Now
● Object orientation was the king
● Software optimized for single core
● No open frameworks for creating
○ Distributed storage
○ Distributed processing
● SQL was the only dominant way for data analysis
Now
•Functional programming is on rise
● Software needs to exploit multiple cores on single node
There are good frameworks to create distributed systems
○ HDFS for storage
● NoSQL is real alternative now
Software in 2000
● Very few companies had big data issue
● Batch processing system ruled the world
● Volume was big concern compare to velocity
● Mostly used for
○ Search
○ Log analysis
● All companies use big data
● Velocity is as much concern as volume
Needs of real time are as much important as batch
processing
Big Data processing needs in 2000
NOW
• A fast and general engine for large scale data
processing
• Created by AMPLab
• Written in Scala
•Licensed under Apache
Apache Spark
Spark streaming
graphX
MLlib
Apache sql
seminar presentation on apache-spark
Benefits of a Unified Platform
• No copying of data between systems
•Combine processing types in one program
• Code reuse
• One system to learn
• One system to maintain
Mesos, a distributed system framework as class project
in UC Berkeley in 2009.
● Spark to test how mesos works
● Focused on
○ Iterative programs (ML)
○ Unifying real time and batch processing
● Open sourced in 2010
History of Apache Spark
● You can spark on top any distributed system
● It can run on
○ Yarn
○ Apache Mesos
○ It’s own cluster
Runs everywhere
● Apache Spark is highly modular
The original version contained only 1600 lines of scala
code
● Apache Spark API is extremely simple compared Java
API of M/R
● API is concise and consistent
Small and Simple
Source : http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-
• In Spark, you can cache hdfs data in main memory of
worker nodes
• Spark analysis can be executed directly on in memory
data
● Shuffling also can be done from in memory
● Fault tolerant
In-memory aka Speed
● No separate storage layer
● Integrates well with HDFS
● Can run on Hadoop 1.0 and Hadoop 2.0 YARN
● Excellent integration with ecosystem projects like
Apache Hive, HBase etc
Integration with Hadoop
● Written in Scala but API is not limited to it
● Offers API in
○ Scala
○ Java
○ Python
● You can also do SQL using SparkSQL
Multi language API
Who are using Spark
seminar presentation on apache-spark
seminar presentation on apache-spark

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Cloud Computing Using OpenStack
PPTX
Apache Kafka 0.8 basic training - Verisign
PDF
OpenStack Architecture
PDF
Apache Spark Introduction
PPSX
Apache Flink, AWS Kinesis, Analytics
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Chap 3 infrastructure as a service(iaas)
Apache Kafka Architecture & Fundamentals Explained
Cloud Computing Using OpenStack
Apache Kafka 0.8 basic training - Verisign
OpenStack Architecture
Apache Spark Introduction
Apache Flink, AWS Kinesis, Analytics
Introduction to Apache Flink - Fast and reliable big data processing
Chap 3 infrastructure as a service(iaas)

What's hot (20)

PPTX
Elastic Stack Introduction
ODP
Elasticsearch for beginners
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPTX
Kafka presentation
PPTX
Netflix Data Pipeline With Kafka
PPTX
Building Data Pipelines for Solr with Apache NiFi
PPTX
Kafka Connect - debezium
PPTX
GRPC.pptx
PPT
Introduction to Google App Engine
PPTX
NGINX: Basics and Best Practices
PDF
AWS_Architecture_e-commerce
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Apache Spark Overview
PPTX
Real-time Stream Processing with Apache Flink
PDF
Passive DNS Collection -- the 'dnstap' approach, by Paul Vixie [APNIC 38 / AP...
PPTX
An introduction to Serverless
PPTX
Kafka connect 101
PPTX
Chap 6 cloud security
PPTX
Introduction to spark
PDF
Spark overview
Elastic Stack Introduction
Elasticsearch for beginners
Apache Flink: Real-World Use Cases for Streaming Analytics
Kafka presentation
Netflix Data Pipeline With Kafka
Building Data Pipelines for Solr with Apache NiFi
Kafka Connect - debezium
GRPC.pptx
Introduction to Google App Engine
NGINX: Basics and Best Practices
AWS_Architecture_e-commerce
Understanding Memory Management In Spark For Fun And Profit
Apache Spark Overview
Real-time Stream Processing with Apache Flink
Passive DNS Collection -- the 'dnstap' approach, by Paul Vixie [APNIC 38 / AP...
An introduction to Serverless
Kafka connect 101
Chap 6 cloud security
Introduction to spark
Spark overview
Ad

Similar to seminar presentation on apache-spark (20)

PDF
Introduction to Apache Spark
PPTX
Apache spark
PPTX
Apachespark 160612140708
PPTX
In Memory Analytics with Apache Spark
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PDF
Apache spark
PDF
Apache spark
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Apache Spark Fundamentals
PDF
Introduction to Apache Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Apache spark
PDF
Big data processing with apache spark
PDF
Power Software Development with Apache Spark
PDF
SparkPaper
PPTX
Apache spark
PDF
Review on Apache Spark Technology
PPTX
Apache Spark in Industry
PDF
Apache Spark PDF
PPTX
Big Data Processing with Apache Spark 2014
Introduction to Apache Spark
Apache spark
Apachespark 160612140708
In Memory Analytics with Apache Spark
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Apache spark
Apache spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark Fundamentals
Introduction to Apache Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Apache spark
Big data processing with apache spark
Power Software Development with Apache Spark
SparkPaper
Apache spark
Review on Apache Spark Technology
Apache Spark in Industry
Apache Spark PDF
Big Data Processing with Apache Spark 2014
Ad

More from Jawhar Ali (20)

PPTX
seminar report on What is ransomware
PPTX
seminar report on Sql injection
PPTX
seminar report on kingapp application
PPTX
seminar report on school management system
PPTX
seminar presentation on Face ricognition technology
PPTX
seminar presentation on Digital Jwellery
PPTX
powerpoint presentation on sixth sense Technology
PPT
Powerpoint presentation on 5G wireless technology
PPTX
powerpoint presentation on Google glass
PDF
Table Of Contents Google Glass
PDF
introduction and abstract on Google Glass Major report
PDF
Candidate declaration on Google Glass
PDF
front Page on Google Glass
PDF
Table of contents on blood bank management system
PDF
List of figures in Blood bank management system
PDF
Full report on blood bank management system
PDF
Cand declaration
PDF
Training report on web developing
PDF
seminar report on wireless Sensor network
PPT
Cloud computing ppt
seminar report on What is ransomware
seminar report on Sql injection
seminar report on kingapp application
seminar report on school management system
seminar presentation on Face ricognition technology
seminar presentation on Digital Jwellery
powerpoint presentation on sixth sense Technology
Powerpoint presentation on 5G wireless technology
powerpoint presentation on Google glass
Table Of Contents Google Glass
introduction and abstract on Google Glass Major report
Candidate declaration on Google Glass
front Page on Google Glass
Table of contents on blood bank management system
List of figures in Blood bank management system
Full report on blood bank management system
Cand declaration
Training report on web developing
seminar report on wireless Sensor network
Cloud computing ppt

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Institutional Correction lecture only . . .
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
master seminar digital applications in india
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Classroom Observation Tools for Teachers
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Complications of Minimal Access Surgery at WLH
Anesthesia in Laparoscopic Surgery in India
Week 4 Term 3 Study Techniques revisited.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
O7-L3 Supply Chain Operations - ICLT Program
Institutional Correction lecture only . . .
Supply Chain Operations Speaking Notes -ICLT Program
master seminar digital applications in india
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Final Presentation General Medicine 03-08-2024.pptx
Classroom Observation Tools for Teachers
O5-L3 Freight Transport Ops (International) V1.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Basic Mud Logging Guide for educational purpose
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf

seminar presentation on apache-spark

  • 1. SUBMITTED TO: SUBMITTED BY: Mrs. Suman singh Nikita Vijay (HOD of CSE Dept.) B. Tech –VIII sem(CSE) A SEMINAR PRESENTATION ON “Introduction To Apache Spark”
  • 2. ● Need of new generation distributed system ● Hardware/software evolution in last decade ● Apache Spark • Components of Apache Spark ● Why Spark? ● Who are using Spark? Agenda
  • 3. ● Lot has been changed from 2000 ● Both hardware and software gone through changes ● Big data has become necessity now ● Let’s look at what changed over decade Why we need new generation?
  • 4. ● Disk was cheap so disk was primary source of data ● Network was costly so data locality ● RAM was very costly ● Single core machines were dominant RAM is the king • RAM is primary source of data and we use disk for fallback ● Network is speedier ● Multi core machines are commonplace State of hardware in 2000 Now
  • 5. ● Object orientation was the king ● Software optimized for single core ● No open frameworks for creating ○ Distributed storage ○ Distributed processing ● SQL was the only dominant way for data analysis Now •Functional programming is on rise ● Software needs to exploit multiple cores on single node There are good frameworks to create distributed systems ○ HDFS for storage ● NoSQL is real alternative now Software in 2000
  • 6. ● Very few companies had big data issue ● Batch processing system ruled the world ● Volume was big concern compare to velocity ● Mostly used for ○ Search ○ Log analysis ● All companies use big data ● Velocity is as much concern as volume Needs of real time are as much important as batch processing Big Data processing needs in 2000 NOW
  • 7. • A fast and general engine for large scale data processing • Created by AMPLab • Written in Scala •Licensed under Apache Apache Spark
  • 10. Benefits of a Unified Platform • No copying of data between systems •Combine processing types in one program • Code reuse • One system to learn • One system to maintain
  • 11. Mesos, a distributed system framework as class project in UC Berkeley in 2009. ● Spark to test how mesos works ● Focused on ○ Iterative programs (ML) ○ Unifying real time and batch processing ● Open sourced in 2010 History of Apache Spark
  • 12. ● You can spark on top any distributed system ● It can run on ○ Yarn ○ Apache Mesos ○ It’s own cluster Runs everywhere
  • 13. ● Apache Spark is highly modular The original version contained only 1600 lines of scala code ● Apache Spark API is extremely simple compared Java API of M/R ● API is concise and consistent Small and Simple
  • 15. • In Spark, you can cache hdfs data in main memory of worker nodes • Spark analysis can be executed directly on in memory data ● Shuffling also can be done from in memory ● Fault tolerant In-memory aka Speed
  • 16. ● No separate storage layer ● Integrates well with HDFS ● Can run on Hadoop 1.0 and Hadoop 2.0 YARN ● Excellent integration with ecosystem projects like Apache Hive, HBase etc Integration with Hadoop
  • 17. ● Written in Scala but API is not limited to it ● Offers API in ○ Scala ○ Java ○ Python ● You can also do SQL using SparkSQL Multi language API
  • 18. Who are using Spark