SlideShare a Scribd company logo
Apache Spark vs Apache Flink
Two most contemporary general purpose data processing platform.
AKASH SIHAGPh. No. : +91-7737111579
asihag70@gmail.com akash.sihag@infoobjects.com
Introduction
Apache Spark is a fast and general
engine for large-scale data processing.
Apache Flink
Apache Flink is an open source
platform for distributed stream and
batch data processing.
Apache Spark
The Inception
● 2009 : At UC Berkeley's AMPLab by
Matei Zaharia
● 2010 : Open-sourced under BSD
license.
● 2013 : Donated to Apache Software
Foundation and switched its license
to Apache 2.0
● 2014 : Top Level Apache Project
Apache Flink
● 2010 : Started as a collaboration
of of Technical University Berlin,
Humboldt-Universität zu Berlin,
and Hasso-Plattner-Institut
Potsdam.
● 2014 : Apache Incubator.
● 2014(Dec) : Apache Top Level
Project.
Apache Spark
Overview Spark
Apache link
Components:
● Spark SQL : For SQL and unstructured data processing.
● Spark Streaming: For processing live streaming data.
● MLib : Machine Learning Algorithm.
● GraphX : Graph based processing.
● Spark Core : Its the base processing engine in Spark that works on the concept of RDD and
all API’s resides on top of it.
Deploy:
● Standalone : Included with Spark.
● Apache Mesos
● Apache YARN : Hadoop 2 resource manager.
Spark Core
Overview Flink
Apache link
Components:
● DataStream API : For unbounded streams.
● DataSet API : Batch processing.
● Table API : For SQL like operations.
● CEP : Complex Event processing API.
● M L Library : For machine Learning algorithms.
● Gelly : Graph processing API.
Deploy:
● Standalone : Included with Flink.
● Local : Single JVM
● Apache YARN : Hadoop 2 resource manager.
Deep Dive
Computing Paradigm
● Work on the abstraction of RDD i.e.
Resilient distributed datasets.
● Supports in-memory computation.
● Lazy Evaluation (Transformation-
action).
● DAG is generated for every Spark
Job.
● Streams are processed as chunks
of batches.
Apache Flink
● Works on the abstraction of Cyclic
Data Flows.
● Supports in-memory computation.
● Lazy Evaluation (Iterative-
Transformation).
● Job Graph are generated.
● Batches are processed as
streams.
Apache Spark
Similarities
Apache link● Both are data processing platforms.
● Similar kind of collection APIs.
● Leverages frameworks like AKKA, YARN.
● Since APIs are similar, code porting takes less efforts.
● Both provides stream and batch processing.
● Fault-Tolerant.
● APIs in JAVA and Scala.
Apache Spark Apache Flink
● Near real time stream processing.
● Batch and streaming transformations are
possible.
● Limited window based operations.
● Catalyst Optimizer for SQL operations.
● Stateful Operation till v1.5 are not so
efficient.
Note: In Spark 1.6 stateful operations are
drastically improved.
● Structured data source support is matured.
Ex: HiveContext can be created directly via
Spark SQL.
● More committer and third party APIs.
● Spark uses JAVA Heap memory allocation
for cached data.
Note: From Spark 1.5 spark started implementing
off-heap memory allocation (Tungsten).
● ML algos are implemented via DAG
● Real time stream processing.
● Batch with streams operations are not
possible and so operating on historic data
with live streaming is not so great.
● Various flavours of window based
operations based on triggers, record counts
and events.
● Optimizer for streams as well as batches.
● Efficient stateful stream operations.
● Structured data support is not so matured
and still only have Hadoop InputFormat
API.
● Relatively new ecosystem.
● Flink implemented custom memory
allocation from its inception.
● ML algos are implemented in native style.
VS
Conclusion:
Past
● Spark came first
as a unified
platform and lead
the Big Data world.
● Flink took some
time to come into
existence.
Present
● Spark due to its lead is now
more mature and has a big
community and API support.
● Flink improved the unified
platform idea and is also
capable of solving Spark’s
limitations to some extent.
Claims itself to be faster in
stream as well as batch
processing.
Future
● As Spark has a very
fast development
cycle, it is supposed
to improve itself over
time.
● Flink proved itself
better than Spark as
far as abstraction is
concerned but is still
a newbie.
THANK YOU

More Related Content

PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
PPTX
Snowflake Datawarehouse Architecturing
PDF
Airflow introduction
PDF
Intro to Delta Lake
PDF
Apache Hudi: The Path Forward
PPTX
Azure Data Factory ETL Patterns in the Cloud
PPTX
Flink vs. Spark
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Snowflake Datawarehouse Architecturing
Airflow introduction
Intro to Delta Lake
Apache Hudi: The Path Forward
Azure Data Factory ETL Patterns in the Cloud
Flink vs. Spark
Incremental View Maintenance with Coral, DBT, and Iceberg

What's hot (20)

PPTX
Apache Flink in the Cloud-Native Era
PDF
Apache ZooKeeper
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Real-time Analytics with Trino and Apache Pinot
PPTX
Snowflake Architecture.pptx
PDF
Implementing Observability for Kubernetes.pdf
PPTX
Apache Spark Architecture
PPTX
Zero to Snowflake Presentation
PDF
Apache airflow
PPTX
Azure Synapse Analytics Overview (r1)
PDF
Apache Flink internals
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Deep Dive into the New Features of Apache Spark 3.0
PPTX
Enabling the Active Data Warehouse with Apache Kudu
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Introducing Databricks Delta
PDF
Building an open data platform with apache iceberg
PDF
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PPTX
Apache Flink and what it is used for
Apache Flink in the Cloud-Native Era
Apache ZooKeeper
Processing Large Data with Apache Spark -- HasGeek
Real-time Analytics with Trino and Apache Pinot
Snowflake Architecture.pptx
Implementing Observability for Kubernetes.pdf
Apache Spark Architecture
Zero to Snowflake Presentation
Apache airflow
Azure Synapse Analytics Overview (r1)
Apache Flink internals
Tame the small files problem and optimize data layout for streaming ingestion...
Deep Dive into the New Features of Apache Spark 3.0
Enabling the Active Data Warehouse with Apache Kudu
Apache Spark on K8S Best Practice and Performance in the Cloud
Introducing Databricks Delta
Building an open data platform with apache iceberg
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Flink and what it is used for
Ad

Viewers also liked (20)

PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
PPTX
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
PDF
21.04.2016 Meetup: Spark vs. Flink
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PPTX
Flink history, roadmap and vision
PPTX
Flink Streaming
PDF
Marton Balassi – Stateful Stream Processing
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
PDF
MmmooOgle: From Big Data to Decisions for Dairy Cows
PPTX
Apache Flink Training: DataSet API Basics
PPTX
Apache Flink Training: System Overview
PDF
Vasia Kalavri – Training: Gelly School
Apache Flink Overview at SF Spark and Friends
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
21.04.2016 Meetup: Spark vs. Flink
K. Tzoumas & S. Ewen – Flink Forward Keynote
Alexander Kolb – Flink. Yet another Streaming Framework?
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink history, roadmap and vision
Flink Streaming
Marton Balassi – Stateful Stream Processing
Kamal Hakimzadeh – Reproducible Distributed Experiments
Matthias J. Sax – A Tale of Squirrels and Storms
Ufuc Celebi – Stream & Batch Processing in one System
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
MmmooOgle: From Big Data to Decisions for Dairy Cows
Apache Flink Training: DataSet API Basics
Apache Flink Training: System Overview
Vasia Kalavri – Training: Gelly School
Ad

Similar to Apache Spark vs Apache Flink (20)

PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PDF
Introduction to Apache Flink
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PDF
Apache spark y cómo lo usamos en nuestros proyectos
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PDF
Analyzing Data at Scale with Apache Spark
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PPTX
Introduction to spark
PPTX
spark example spark example spark examplespark examplespark examplespark example
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PDF
Present and future of unified, portable, and efficient data processing with A...
PPTX
seminar presentation on apache-spark
PDF
Introduction to Apache Spark
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
PDF
Apache Spark and Python: unified Big Data analytics
PPT
Big_data_analytics_NoSql_Module-4_Session
Spark Streaming and MLlib - Hyderabad Spark Group
Introduction to Apache Flink
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Apache spark y cómo lo usamos en nuestros proyectos
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Analyzing Data at Scale with Apache Spark
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Introduction to spark
spark example spark example spark examplespark examplespark examplespark example
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Present and future of unified, portable, and efficient data processing with A...
seminar presentation on apache-spark
Introduction to Apache Spark
Unified Batch and Real-Time Stream Processing Using Apache Flink
Apache Spark and Python: unified Big Data analytics
Big_data_analytics_NoSql_Module-4_Session

Recently uploaded (20)

PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Advanced Soft Computing BINUS July 2025.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf

Apache Spark vs Apache Flink

  • 1. Apache Spark vs Apache Flink Two most contemporary general purpose data processing platform. AKASH SIHAGPh. No. : +91-7737111579 asihag70@gmail.com akash.sihag@infoobjects.com
  • 2. Introduction Apache Spark is a fast and general engine for large-scale data processing. Apache Flink Apache Flink is an open source platform for distributed stream and batch data processing. Apache Spark
  • 3. The Inception ● 2009 : At UC Berkeley's AMPLab by Matei Zaharia ● 2010 : Open-sourced under BSD license. ● 2013 : Donated to Apache Software Foundation and switched its license to Apache 2.0 ● 2014 : Top Level Apache Project Apache Flink ● 2010 : Started as a collaboration of of Technical University Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-Institut Potsdam. ● 2014 : Apache Incubator. ● 2014(Dec) : Apache Top Level Project. Apache Spark
  • 4. Overview Spark Apache link Components: ● Spark SQL : For SQL and unstructured data processing. ● Spark Streaming: For processing live streaming data. ● MLib : Machine Learning Algorithm. ● GraphX : Graph based processing. ● Spark Core : Its the base processing engine in Spark that works on the concept of RDD and all API’s resides on top of it. Deploy: ● Standalone : Included with Spark. ● Apache Mesos ● Apache YARN : Hadoop 2 resource manager. Spark Core
  • 5. Overview Flink Apache link Components: ● DataStream API : For unbounded streams. ● DataSet API : Batch processing. ● Table API : For SQL like operations. ● CEP : Complex Event processing API. ● M L Library : For machine Learning algorithms. ● Gelly : Graph processing API. Deploy: ● Standalone : Included with Flink. ● Local : Single JVM ● Apache YARN : Hadoop 2 resource manager.
  • 7. Computing Paradigm ● Work on the abstraction of RDD i.e. Resilient distributed datasets. ● Supports in-memory computation. ● Lazy Evaluation (Transformation- action). ● DAG is generated for every Spark Job. ● Streams are processed as chunks of batches. Apache Flink ● Works on the abstraction of Cyclic Data Flows. ● Supports in-memory computation. ● Lazy Evaluation (Iterative- Transformation). ● Job Graph are generated. ● Batches are processed as streams. Apache Spark
  • 8. Similarities Apache link● Both are data processing platforms. ● Similar kind of collection APIs. ● Leverages frameworks like AKKA, YARN. ● Since APIs are similar, code porting takes less efforts. ● Both provides stream and batch processing. ● Fault-Tolerant. ● APIs in JAVA and Scala.
  • 9. Apache Spark Apache Flink ● Near real time stream processing. ● Batch and streaming transformations are possible. ● Limited window based operations. ● Catalyst Optimizer for SQL operations. ● Stateful Operation till v1.5 are not so efficient. Note: In Spark 1.6 stateful operations are drastically improved. ● Structured data source support is matured. Ex: HiveContext can be created directly via Spark SQL. ● More committer and third party APIs. ● Spark uses JAVA Heap memory allocation for cached data. Note: From Spark 1.5 spark started implementing off-heap memory allocation (Tungsten). ● ML algos are implemented via DAG ● Real time stream processing. ● Batch with streams operations are not possible and so operating on historic data with live streaming is not so great. ● Various flavours of window based operations based on triggers, record counts and events. ● Optimizer for streams as well as batches. ● Efficient stateful stream operations. ● Structured data support is not so matured and still only have Hadoop InputFormat API. ● Relatively new ecosystem. ● Flink implemented custom memory allocation from its inception. ● ML algos are implemented in native style. VS
  • 10. Conclusion: Past ● Spark came first as a unified platform and lead the Big Data world. ● Flink took some time to come into existence. Present ● Spark due to its lead is now more mature and has a big community and API support. ● Flink improved the unified platform idea and is also capable of solving Spark’s limitations to some extent. Claims itself to be faster in stream as well as batch processing. Future ● As Spark has a very fast development cycle, it is supposed to improve itself over time. ● Flink proved itself better than Spark as far as abstraction is concerned but is still a newbie.