SlideShare a Scribd company logo
FLINK - A CONVENIENT ABSTRACTION LAYER
FOR YARN?
VYACHESLAV ZHOLUDEV
INTRODUCTION
• YARN opened Hadoop for many more developers
• API to integrate into a Hadoop cluster
• Flexibility
• Applications: MR, TEZ, Flink, Spark,…
• Flink has been great in using the opportunity
• Flexible program execution graph
• Operators other than Map and Reduce
• Clean and convenient API
• Efficient with I/O
EXPECTATIONS FROM YARN
• New programming models in addition to MapReduce
• More alternatives to cover cases where the MapReduce paradigm does
not suit well
• Flexibility with expressing operations on data
• Elasticity of a cluster
• Ability to write own applications to distribute computations across
the cluster
DISTRIBUTING COMPUTATIONAL TASKS
• Writing own YARN application
• Complicated
• Tedious
• Error-prone
• Somebody must have done
something simpler
• Apache Twill
• Was not simple enough still
• Execute CLI tools remotely
(if everything else fails)
• Flink?
FLINK AT RESEARCHGATE
Lots of benefits:
• Made MapReduce jobs more readable
• More compact
• Less boiler plate code
• Easier to understand and maintain
• Got rid of ugly Hive queries and optimised runtime
• Better and cleaner orchestration of workflow
subtasks (before we had to glue multiple MR jobs)
• Iterative machine learning algorithms
• Distributing computational tasks across a cluster
REAL USE CASE:
MONGODB TO AVRO BRIDGE
REAL USE CASE
• In essence:
• Reads MongoDB documents
• Converts them to Avro records (based on a provided Avro schema)
• Persists them on HDFS
• Avrongo evolution
• One threaded program
• Multi-threaded program talking to different shards in parallel
• Distributed across cluster
• Reasons for distributing:
• Were CPU bound
• HDFS load distribution
A MongoDB to Avro Bridge (aka Avrongo)
Used to dump live DB data to HDFS for further batch-processing and analytics
HOW AVRONGO WORKS?
Basic Version
• One thread
• Using one MongoDB cursor to iterate the whole collection
• Suitable for smaller collections
MONGODB SHARDS AND CHUNKS
• Controlling load on the MongoDB cluster
• Deterministic way of splitting collection for input
Utilizing MongoDB chunks
AVRONGO - SHARDED VERSION
• Collecting chunks information (sets of documents living on a particular
shard)
• Processing chunks of each shard in a separate group of threads
AVRONGO - FLINK VERSION
• Custom InputFormat that distributes MongoDB chunks uniformly
• FlatMap operator
• Number of task nodes = (number of shards) x (parallelism per shard)
• Custom Generic AvroOutputFormat
• Slower shards receive a bit more attention
FLINK APPROACH
Outcome
• No longer bound by CPU
• Imports to HDFS are faster
• Some collections: from 6h to 2.5h or from 3.5h to 2h
• Very few lines of code
• Same command line interface (no efforts to migrate to Flink-based version)
• Reusing the same converter as in standalone versions
• All orchestration and parallelisation work is done automatically by Flink
Benefits
ANOTHER USE CASE:
DISTRIBUTED FILE COPYING
HADOOP DISTCP
• Generates a MapReduce job that copies big amount of data
• List of files as an input to a Map Task
• Two types of Input Formats:
• UniformSizeInputFormat
• DynamicInputFormat
• gives more load to faster mappers
• complicated code
• utilizes FS to feed the mappers
https://hadoop.apache.org/docs/r1.2.1/distcp2.html
• Implements the same logic as in a
DynamicInputFormat of Hadoop’s distcp
• Much fewer lines of code
• Same runtime as Hadoop distcp
• Available in Flink Java examples
• Not fault-tolerant (yet)
FLINK DISTCP
https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/
src/main/java/org/apache/flink/examples/java/distcp
CONCLUSIONS
CONCLUSIONS
• Flink - a thin layer for implementing your YARN application for parallelising
independent tasks on the cluster
• Thanks to custom input formats that are easy to implement
• No boilerplate code
Would be nice to have:
• Elasticity
• Better progress tracking
• Fault tolerance
Custom input format + a Flink operator with business logic = Happiness
QUESTIONS?
https://www.researchgate.net/careers

More Related Content

PDF
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
PDF
Christian Kreuzfeld – Static vs Dynamic Stream Processing
PPTX
Fabian Hueske – Cascading on Flink
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
PDF
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Fabian Hueske – Cascading on Flink
Alexander Kolb – Flink. Yet another Streaming Framework?
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

What's hot (20)

PDF
Marton Balassi – Stateful Stream Processing
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
PPTX
SICS: Apache Flink Streaming
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PPTX
Apache flink
PDF
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
PDF
Flink Apachecon Presentation
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
Stateful Distributed Stream Processing
PDF
Introduction to Apache Flink
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Marton Balassi – Stateful Stream Processing
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
SICS: Apache Flink Streaming
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Apache flink
Flink forward SF 2017: Elizabeth K. Joseph and Ravi Yadav - Flink meet DC/OS ...
Flink Apachecon Presentation
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Stateful Distributed Stream Processing
Introduction to Apache Flink
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Dongwon Kim – A Comparative Performance Evaluation of Flink
Ad

Viewers also liked (20)

PDF
Mikio Braun – Data flow vs. procedural programming
PPTX
Aljoscha Krettek – Notions of Time
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
PDF
Vasia Kalavri – Training: Gelly School
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
PPTX
Assaf Araki – Real Time Analytics at Scale
PPTX
Apache Flink - Hadoop MapReduce Compatibility
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
PDF
Fabian Hueske – Juggling with Bits and Bytes
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
PPTX
Apache Flink Training: DataStream API Part 1 Basic
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
PDF
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Slim Baltagi – Flink vs. Spark
Mikio Braun – Data flow vs. procedural programming
Aljoscha Krettek – Notions of Time
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Vasia Kalavri – Training: Gelly School
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Assaf Araki – Real Time Analytics at Scale
Apache Flink - Hadoop MapReduce Compatibility
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Matthias J. Sax – A Tale of Squirrels and Storms
Flink 0.10 @ Bay Area Meetup (October 2015)
Apache Flink Training: DataStream API Part 2 Advanced
Fabian Hueske – Juggling with Bits and Bytes
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Apache Flink Training: DataStream API Part 1 Basic
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Introduction to Apache Flink - Fast and reliable big data processing
Slim Baltagi – Flink vs. Spark
Ad

Similar to Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn? (20)

PPTX
High-level languages for Big Data Analytics (Presentation)
PPTX
Overview of Cascading 3.0 on Apache Flink
PDF
Neptune @ SoCal
PDF
Introduction to Impala
PDF
SpringPeople Introduction to Apache Hadoop
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
SQL Server 2012 and Big Data
PPTX
Getting started big data
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop and MapReduce Introductort presentation
PDF
Search onhadoopsfhug081413
PPTX
Highly Scalable Data Service (HSDS) Performance Features
PPTX
Apache Tez - A unifying Framework for Hadoop Data Processing
PPTX
Rich Data Graphs for MapReduce
PPTX
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
PDF
NoSQL and CouchDB: the view from MOO
PPTX
The Evolution of the Hadoop Ecosystem
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PPTX
haXe - One codebase to rule'em all
PDF
Introduction to apache spark
High-level languages for Big Data Analytics (Presentation)
Overview of Cascading 3.0 on Apache Flink
Neptune @ SoCal
Introduction to Impala
SpringPeople Introduction to Apache Hadoop
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
SQL Server 2012 and Big Data
Getting started big data
Introduction to Hadoop and Big Data
Hadoop and MapReduce Introductort presentation
Search onhadoopsfhug081413
Highly Scalable Data Service (HSDS) Performance Features
Apache Tez - A unifying Framework for Hadoop Data Processing
Rich Data Graphs for MapReduce
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
NoSQL and CouchDB: the view from MOO
The Evolution of the Hadoop Ecosystem
Big data components - Introduction to Flume, Pig and Sqoop
haXe - One codebase to rule'em all
Introduction to apache spark

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Building Integrated photovoltaic BIPV_UPV.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Chapter 3 Spatial Domain Image Processing.pdf

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

  • 1. FLINK - A CONVENIENT ABSTRACTION LAYER FOR YARN? VYACHESLAV ZHOLUDEV
  • 2. INTRODUCTION • YARN opened Hadoop for many more developers • API to integrate into a Hadoop cluster • Flexibility • Applications: MR, TEZ, Flink, Spark,… • Flink has been great in using the opportunity • Flexible program execution graph • Operators other than Map and Reduce • Clean and convenient API • Efficient with I/O
  • 3. EXPECTATIONS FROM YARN • New programming models in addition to MapReduce • More alternatives to cover cases where the MapReduce paradigm does not suit well • Flexibility with expressing operations on data • Elasticity of a cluster • Ability to write own applications to distribute computations across the cluster
  • 4. DISTRIBUTING COMPUTATIONAL TASKS • Writing own YARN application • Complicated • Tedious • Error-prone • Somebody must have done something simpler • Apache Twill • Was not simple enough still • Execute CLI tools remotely (if everything else fails) • Flink?
  • 5. FLINK AT RESEARCHGATE Lots of benefits: • Made MapReduce jobs more readable • More compact • Less boiler plate code • Easier to understand and maintain • Got rid of ugly Hive queries and optimised runtime • Better and cleaner orchestration of workflow subtasks (before we had to glue multiple MR jobs) • Iterative machine learning algorithms • Distributing computational tasks across a cluster
  • 6. REAL USE CASE: MONGODB TO AVRO BRIDGE
  • 7. REAL USE CASE • In essence: • Reads MongoDB documents • Converts them to Avro records (based on a provided Avro schema) • Persists them on HDFS • Avrongo evolution • One threaded program • Multi-threaded program talking to different shards in parallel • Distributed across cluster • Reasons for distributing: • Were CPU bound • HDFS load distribution A MongoDB to Avro Bridge (aka Avrongo) Used to dump live DB data to HDFS for further batch-processing and analytics
  • 8. HOW AVRONGO WORKS? Basic Version • One thread • Using one MongoDB cursor to iterate the whole collection • Suitable for smaller collections
  • 9. MONGODB SHARDS AND CHUNKS • Controlling load on the MongoDB cluster • Deterministic way of splitting collection for input Utilizing MongoDB chunks
  • 10. AVRONGO - SHARDED VERSION • Collecting chunks information (sets of documents living on a particular shard) • Processing chunks of each shard in a separate group of threads
  • 11. AVRONGO - FLINK VERSION • Custom InputFormat that distributes MongoDB chunks uniformly • FlatMap operator • Number of task nodes = (number of shards) x (parallelism per shard) • Custom Generic AvroOutputFormat • Slower shards receive a bit more attention
  • 12. FLINK APPROACH Outcome • No longer bound by CPU • Imports to HDFS are faster • Some collections: from 6h to 2.5h or from 3.5h to 2h • Very few lines of code • Same command line interface (no efforts to migrate to Flink-based version) • Reusing the same converter as in standalone versions • All orchestration and parallelisation work is done automatically by Flink Benefits
  • 14. HADOOP DISTCP • Generates a MapReduce job that copies big amount of data • List of files as an input to a Map Task • Two types of Input Formats: • UniformSizeInputFormat • DynamicInputFormat • gives more load to faster mappers • complicated code • utilizes FS to feed the mappers https://hadoop.apache.org/docs/r1.2.1/distcp2.html
  • 15. • Implements the same logic as in a DynamicInputFormat of Hadoop’s distcp • Much fewer lines of code • Same runtime as Hadoop distcp • Available in Flink Java examples • Not fault-tolerant (yet) FLINK DISTCP https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/ src/main/java/org/apache/flink/examples/java/distcp
  • 17. CONCLUSIONS • Flink - a thin layer for implementing your YARN application for parallelising independent tasks on the cluster • Thanks to custom input formats that are easy to implement • No boilerplate code Would be nice to have: • Elasticity • Better progress tracking • Fault tolerance Custom input format + a Flink operator with business logic = Happiness