SlideShare a Scribd company logo
SPARK
Alexey Diomin, diominay@gmail.com
Intro
Basic
 RDD
 DAG
RDD
 Resilient Distributed Dataset
RDD
 Resilient Distributed Dataset
 SchemaRDD
DAG
DAG
DAG
Mythology
 Spark is not MapReduce
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
 InMemory processing
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
 InMemory processing
 Spark Streaming is real-time streaming
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
 InMemory processing
 Spark Streaming is real-time streaming
 Lightning-fast cluster computing
MapReduce
MapReduce
MapReduce
Not MapReduce
Spark
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
Spark
 Run programs up to 100x faster than Hadoop
MapReduce* in memory, or 10x faster on disk
*Hadoop without Tez
http://spark.apache.org/
InMemory
InMemory
 The MapReduce and Spark shuffles use a “pull”
model. Every map task writes out data to local
disk, and then the reduce tasks make remote
requests to fetch that data
 http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
Spark Streaming
 RDD
 DAG
Spark Streaming
Spark Streaming
Receiver.store(...)
Spark Streaming
Google Cloud Dataflow
 One of the most compelling aspects of Cloud
Dataflow is its approach to one of the most
difficult problems facing data engineers: how to
develop pipeline logic that can execute in both
batch and streaming contexts.
 http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-
dataflow-on-apache-spark/
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Spark
 Logging
 Pipeline
 Indexes
 Job progress
 Effective Memory
 Network
Example
Staged (batch) execution
Pipelined execution
Indexes
 Netflix
 https://github.com/amplab/spark-indexedrdd
Job Progress
 Accumulators
 Broadcast
Memory
 val value = task.run(taskId, attemptNumber)
Memory
 val value = task.run(taskId, attemptNumber)
 val valueBytes = resultSer.serialize(value)
Memory
 val value = task.run(taskId, attemptNumber)
 val valueBytes = resultSer.serialize(value)
 val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
 val serializedDirectResult = ser.serialize(directResult)
Memory
 val value = task.run(taskId, attemptNumber)
 val valueBytes = resultSer.serialize(value)
 val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
 val serializedDirectResult = ser.serialize(directResult)
 Default JavaSerializer
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
Network
Network
Network
 Problem with firewall/nat/multiple ip/etc.
SQL
 Shark (dead)
 Spark SQL
 Spark on Hive
SparkR
SparkR
 Unstable API
 Minimum docs
SparkR
 Unstable API
 Minimum docs
 Rstudio Server
Links
 Spark
 http://spark.apache.org/
 Flink
 http://flink.apache.org/
 Tez
 http://tez.apache.org/

More Related Content

PDF
Beneath RDD in Apache Spark by Jacek Laskowski
PDF
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PPTX
Intro to Spark development
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PDF
Video Games at Scale: Improving the gaming experience with Apache Spark
PDF
Top 5 mistakes when writing Spark applications
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
Beneath RDD in Apache Spark by Jacek Laskowski
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Intro to Spark development
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Video Games at Scale: Improving the gaming experience with Apache Spark
Top 5 mistakes when writing Spark applications
Python and Bigdata - An Introduction to Spark (PySpark)

What's hot (20)

PPTX
Spark meetup feb 2016
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
DIscover Spark and Spark streaming
PDF
Spark performance tuning - Maksud Ibrahimov
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Performance Troubleshooting Using Apache Spark Metrics
PPTX
Spark 1.6 vs Spark 2.0
PDF
SparkCruise: Automatic Computation Reuse in Apache Spark
PDF
Scale-Out Using Spark in Serverless Herd Mode!
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
Spark tutorial
PDF
Improving Apache Spark Downscaling
PDF
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
PDF
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
PDF
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Spark meetup feb 2016
Spark Streaming and MLlib - Hyderabad Spark Group
Top 5 Mistakes When Writing Spark Applications
DIscover Spark and Spark streaming
Spark performance tuning - Maksud Ibrahimov
Spark Summit EU talk by Rolf Jagerman
Performance Troubleshooting Using Apache Spark Metrics
Spark 1.6 vs Spark 2.0
SparkCruise: Automatic Computation Reuse in Apache Spark
Scale-Out Using Spark in Serverless Herd Mode!
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Spark tutorial
Improving Apache Spark Downscaling
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Processing Large Data with Apache Spark -- HasGeek
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Ad

Similar to «Почему Spark отнюдь не так хорош» (20)

PPTX
Paris Data Geek - Spark Streaming
PDF
Spark For Faster Batch Processing
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
PDF
Apache Spark Introduction.pdf
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
New Developments in Spark
PPTX
Apache Spark Architecture
PPTX
PDF
Module01
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Adios hadoop, Hola Spark! T3chfest 2015
PDF
5 Reasons why Spark is in demand!
PDF
Apache Spark beyond Hadoop MapReduce
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
5 things one must know about spark!
PDF
An introduction To Apache Spark
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
A look under the hood at Apache Spark's API and engine evolutions
Paris Data Geek - Spark Streaming
Spark For Faster Batch Processing
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Apache Spark Introduction.pdf
Lightening Fast Big Data Analytics using Apache Spark
New Developments in Spark
Apache Spark Architecture
Module01
Spark Summit EU 2015: Lessons from 300+ production users
Adios hadoop, Hola Spark! T3chfest 2015
5 Reasons why Spark is in demand!
Apache Spark beyond Hadoop MapReduce
Jump Start with Apache Spark 2.0 on Databricks
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
5 things one must know about spark!
An introduction To Apache Spark
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Spark Summit East 2015 Advanced Devops Student Slides
A look under the hood at Apache Spark's API and engine evolutions
Ad

More from Olga Lavrentieva (20)

PPTX
15 10-22 altoros-fact_sheet_st_v4
PPTX
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
PPTX
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
PDF
Владимир Иванов (Oracle): Java: прошлое и будущее
PPTX
Brug - Web push notification
PDF
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
PPTX
Максим Жилинский: "Контейнеры: под капотом"
PPTX
Александр Протасеня: "PayPal. Различные способы интеграции"
PPTX
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
PPTX
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
PDF
Егор Воробьёв: «Ruby internals»
PDF
Андрей Колешко «Что не так с Rails»
PDF
Дмитрий Савицкий «Ruby Anti Magic Shield»
PPTX
Сергей Алексеев «Парное программирование. Удаленно»
PPTX
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
PPTX
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
PPTX
«Дизайн продвинутых нереляционных схем для Big Data»
PPTX
«Обзор возможностей Open cv»
PPTX
«Нужно больше шин! Eventbus based framework vertx.io»
PDF
«Работа с базами данных с использованием Sequel»
15 10-22 altoros-fact_sheet_st_v4
Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Владимир Иванов (Oracle): Java: прошлое и будущее
Brug - Web push notification
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Максим Жилинский: "Контейнеры: под капотом"
Александр Протасеня: "PayPal. Различные способы интеграции"
Сергей Черничков: "Интеграция платежных систем в .Net приложения"
Антон Шемерей «Single responsibility principle в руби или почему instanceclas...
Егор Воробьёв: «Ruby internals»
Андрей Колешко «Что не так с Rails»
Дмитрий Савицкий «Ruby Anti Magic Shield»
Сергей Алексеев «Парное программирование. Удаленно»
«Cassandra data modeling – моделирование данных для NoSQL СУБД Cassandra»
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
«Дизайн продвинутых нереляционных схем для Big Data»
«Обзор возможностей Open cv»
«Нужно больше шин! Eventbus based framework vertx.io»
«Работа с базами данных с использованием Sequel»

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Modernizing your data center with Dell and AMD
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

«Почему Spark отнюдь не так хорош»

Editor's Notes