SlideShare a Scribd company logo
Apache Flink
Hadoop Compatibility
Fabian Hueske @fhueske
Hadoop MapReduce Jobs
Input Map Reduce Output
InputFormat Mapper Reducer OutputFormat
• Jobs have a static structure.
• Input, Output, Map, Reduce run your custom (or library) code.
• If application logic is too complex, you need more than one job.
Flink Programs
Source Map Reduce
Source
Source
Filter
Join
CoGroup Sink
• Flink program are DAG data flows.
• Data Sources, Data Sinks, Map and Reduce operators are included.
• Everything that MapReduce gives and much more (super set).
• Much better performance
• Especially if more than 1 MR job is executed.
Run your Hadoop code with Flink?
• Hadoop data types (Writable) are natively supported.
• Hadoop Filesystems are natively supported.
• Flink features Input- & OutputFormats, Map, and Reduce
functions, just like Hadoop MapReduce.
• Concepts are the same, but interfaces are not :-(
But Flink provides wrappers for Hadoop code :-)
• mapred.* API: In/OutputFormat, Mappers, & Reducers
• mapreduce.* API: In/OutputFormat
Alright, sounds good…
… but will my WordCount still work?!?
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// set up Hadoop InputFormat
HadoopInputFormat<LongWritable, Text> hadoopInputFormat =
new HadoopInputFormat<LongWritable, Text>(new TextInputFormat(), LongWritable.class, Text.class, new JobConf());
TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath));
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat); // read data with Hadoop InputFormat
DataSet<Tuple2<Text, LongWritable>> words =
// apply Hadoop Mapper
text.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer()))
// apply Hadoop Reducer
.groupBy(0).reduceGroup(new HadoopReduceFunction<Text, LongWritable, Text, LongWritable>(new Counter()));
// set up Hadoop Output Format
HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat =
new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf());
hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath));
words.output(hadoopOutputFormat); // write data with Hadoop OutputFormat
env.execute("Hadoop Compat WordCount"); // execute the program
Hadoop Data Types Hadoop Input- & OutputFormats Your Hadoop Functions
Yes, it will…
Use MapReduce like you always wanted
• Freely assemble your functions into a program.
• Very efficient, pipelined execution.
– Program is executed on Flink (no Hadoop involved).
– No writing to/reading from HDFS within a program.
• Caveat: No support for custom Hadoop partitioners & sorters, yet :-(
Input Map Reduce
Input
Output
Reduce
Map Reduce
Output
WHAT TO EXPECT NEXT?
Hadoop Job
Do not change a single line of code!
• Inject MapReduce jobs as a whole into Flink programs
– with support for custom partitioners, sorters, groupers.
• Run Hadoop MapReduce jobs on Flink
– without changing a single line of code.
Source Map Reduce
Source
Source
Filter
Join
CoGroup Sink
Looking for some fun?
Try Hadoop on Flink!

More Related Content

PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
PDF
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Parquet performance tuning: the missing guide
PPTX
Building data pipelines
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
What's new in pandas and the SciPy stack for financial users
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Parquet performance tuning: the missing guide
Building data pipelines

What's hot (20)

PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
How to use Parquet as a basis for ETL and analytics
PPTX
Building a modern Application with DataFrames
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Adding Complex Data to Spark Stack by Tug Grall
PPTX
Functional Programming and Big Data
PDF
From flat files to deconstructed database
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PPTX
Spark meetup v2.0.5
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Advanced Analytics using Apache Hive
PDF
A look ahead at spark 2.0
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Improving Apache Spark Downscaling
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
How to use Parquet as a basis for ETL and analytics
Building a modern Application with DataFrames
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Why you should care about data layout in the file system with Cheng Lian and ...
Transformation Processing Smackdown; Spark vs Hive vs Pig
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
SparkSQL: A Compiler from Queries to RDDs
Adding Complex Data to Spark Stack by Tug Grall
Functional Programming and Big Data
From flat files to deconstructed database
Use r tutorial part1, introduction to sparkr
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Spark meetup v2.0.5
The Parquet Format and Performance Optimization Opportunities
Advanced Analytics using Apache Hive
A look ahead at spark 2.0
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Improving Apache Spark Downscaling
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Ad

Viewers also liked (20)

PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
PPTX
Fabian Hueske – Cascading on Flink
PPTX
Assaf Araki – Real Time Analytics at Scale
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PDF
Fabian Hueske – Juggling with Bits and Bytes
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
PPTX
Slim Baltagi – Flink vs. Spark
PPTX
Apache Flink Training: System Overview
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
PPTX
Apache Flink Training: DataStream API Part 1 Basic
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PDF
Vasia Kalavri – Training: Gelly School
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Apache Flink internals
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Fabian Hueske – Cascading on Flink
Assaf Araki – Real Time Analytics at Scale
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink 0.10 @ Bay Area Meetup (October 2015)
Matthias J. Sax – A Tale of Squirrels and Storms
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Fabian Hueske – Juggling with Bits and Bytes
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Slim Baltagi – Flink vs. Spark
Apache Flink Training: System Overview
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Apache Flink Training: DataStream API Part 1 Basic
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Vasia Kalavri – Training: Gelly School
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Ufuc Celebi – Stream & Batch Processing in one System
Apache Flink internals
Ad

Similar to Apache Flink - Hadoop MapReduce Compatibility (20)

PPTX
Hadoop MapReduce framework - Module 3
PDF
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
PPT
Hadoop_Pennonsoft
PPTX
Writing Hadoop Jobs in Scala using Scalding
PPT
Hadoop - Introduction to mapreduce
PDF
Apache Hadoop Java API
PPTX
Mapreduce advanced
PPTX
Cs267 hadoop programming
PDF
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
KEY
Hadoop本 輪読会 1章〜2章
PPTX
map reduce Technic in big data
PDF
Hadoop first mr job - inverted index construction
PPTX
Hands on Hadoop and pig
PPTX
Lecture 04 big data analytics | map reduce
PPT
Big-data-analysis-training-in-mumbai
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Basic of Big Data
PDF
Hadoop Hackathon Reader
PDF
Basics of big data analytics hadoop
Hadoop MapReduce framework - Module 3
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop_Pennonsoft
Writing Hadoop Jobs in Scala using Scalding
Hadoop - Introduction to mapreduce
Apache Hadoop Java API
Mapreduce advanced
Cs267 hadoop programming
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Hadoop本 輪読会 1章〜2章
map reduce Technic in big data
Hadoop first mr job - inverted index construction
Hands on Hadoop and pig
Lecture 04 big data analytics | map reduce
Big-data-analysis-training-in-mumbai
Introduction to Apache Flink - Fast and reliable big data processing
Basic of Big Data
Hadoop Hackathon Reader
Basics of big data analytics hadoop

More from Fabian Hueske (13)

PPTX
Flink SQL in Action
PPTX
Flink's Journey from Academia to the ASF
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PPTX
Stream Analytics with SQL on Apache Flink
PPTX
Stream Analytics with SQL on Apache Flink
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Data Stream Processing with Apache Flink
PPTX
Juggling with Bits and Bytes - How Apache Flink operates on binary data
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PPTX
Apache Flink - A Sneek Preview on Language Integrated Queries
PPTX
Apache Flink - Akka for the Win!
PPTX
Apache Flink - Community Update January 2015
Flink SQL in Action
Flink's Journey from Academia to the ASF
Why and how to leverage the power and simplicity of SQL on Apache Flink
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
Taking a look under the hood of Apache Flink's relational APIs.
Data Stream Processing with Apache Flink
Juggling with Bits and Bytes - How Apache Flink operates on binary data
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - Akka for the Win!
Apache Flink - Community Update January 2015

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Lecture1 pattern recognition............
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Ppt On Nestle.pptx huunnnhhgfvu
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Acumen Training GuidePresentation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Launch Your Data Science Career in Kochi – 2025
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Foundation of Data Science unit number two notes
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Apache Flink - Hadoop MapReduce Compatibility

  • 2. Hadoop MapReduce Jobs Input Map Reduce Output InputFormat Mapper Reducer OutputFormat • Jobs have a static structure. • Input, Output, Map, Reduce run your custom (or library) code. • If application logic is too complex, you need more than one job.
  • 3. Flink Programs Source Map Reduce Source Source Filter Join CoGroup Sink • Flink program are DAG data flows. • Data Sources, Data Sinks, Map and Reduce operators are included. • Everything that MapReduce gives and much more (super set). • Much better performance • Especially if more than 1 MR job is executed.
  • 4. Run your Hadoop code with Flink? • Hadoop data types (Writable) are natively supported. • Hadoop Filesystems are natively supported. • Flink features Input- & OutputFormats, Map, and Reduce functions, just like Hadoop MapReduce. • Concepts are the same, but interfaces are not :-( But Flink provides wrappers for Hadoop code :-) • mapred.* API: In/OutputFormat, Mappers, & Reducers • mapreduce.* API: In/OutputFormat
  • 5. Alright, sounds good… … but will my WordCount still work?!?
  • 6. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // set up Hadoop InputFormat HadoopInputFormat<LongWritable, Text> hadoopInputFormat = new HadoopInputFormat<LongWritable, Text>(new TextInputFormat(), LongWritable.class, Text.class, new JobConf()); TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath)); DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat); // read data with Hadoop InputFormat DataSet<Tuple2<Text, LongWritable>> words = // apply Hadoop Mapper text.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer())) // apply Hadoop Reducer .groupBy(0).reduceGroup(new HadoopReduceFunction<Text, LongWritable, Text, LongWritable>(new Counter())); // set up Hadoop Output Format HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat = new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf()); hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " "); TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath)); words.output(hadoopOutputFormat); // write data with Hadoop OutputFormat env.execute("Hadoop Compat WordCount"); // execute the program Hadoop Data Types Hadoop Input- & OutputFormats Your Hadoop Functions Yes, it will…
  • 7. Use MapReduce like you always wanted • Freely assemble your functions into a program. • Very efficient, pipelined execution. – Program is executed on Flink (no Hadoop involved). – No writing to/reading from HDFS within a program. • Caveat: No support for custom Hadoop partitioners & sorters, yet :-( Input Map Reduce Input Output Reduce Map Reduce Output
  • 9. Hadoop Job Do not change a single line of code! • Inject MapReduce jobs as a whole into Flink programs – with support for custom partitioners, sorters, groupers. • Run Hadoop MapReduce jobs on Flink – without changing a single line of code. Source Map Reduce Source Source Filter Join CoGroup Sink
  • 10. Looking for some fun? Try Hadoop on Flink!