SlideShare a Scribd company logo
FLINK-KUDU-CONNECTOR
AN OPEN-SOURCE CONTRIBUTION TO
DEVELOP KAPPA ARCHITECTURES
WHO WE ARE
Team
RUBÉN CASADO
Big Data Manager,
Accenture Digital
NACHO GARCÍA
Senior Big Data Engineer,
Accenture Digital
• Big Data Chapter Lead in Accenture Digital Delivery Spain
• Apache Flink Madrid Meetup organizer
• Master in Big Data Architecture director at Kschool
• PhD in Computer Science / Software Engineering
• Open-source passionate
• Senior Big Data Engineer at Accenture Digital
• Lecturer of Master in Big Data at Kschool
• Msc in Computer Science
Copyright © 2017 Accenture. All rights reserved.
ruben_casado
ruben.casado.tejedor@accenture.com
0xNacho
n.garcia.fernandez@accenture.com
█ CONTEXT
Agenda
ACCENTURE HAS BEEN RANKED AS #1 BIG DATA PROVIDER IN SPAIN BY
PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM,
THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN
DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND
TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES
Copyright © 2017 Accenture. All rights reserved.
PROJECT STARTED BY JUNIOR DEVELOPERS
• Learn Apache Flink
• Learn Apache Kudu
• Encourage the use and contribution to
the open source community
1) Open Jira ticket
2) Ticket is assigned to you
3) Open a Pull-Request
4) Overhaul the code
5) Done!
█ KAPPA ARCHITECTURE: A REAL NEED
Agenda
MOTIVATION
Batch
Processing
NoSQL
Stream
processing
LAMBDA ARCHITECTURE
Two processing engines
Lambda architecture
SERVING LAYER
BATCH LAYER
QUERIES
SPEED LAYER
ALL DATA
RECENT
DATA
Real-time view
Real-time view
Batch view
Batch view
KAPPA ARCHITECTURE
A single engine
Kappa Architecture
SERVING LAYERSTREAMING LAYER
REPLAYABLE
QUERIES
█ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES
Agenda
APACHE KUDU
What is Apache Kudu?
Online (fast random access)
Analytics(fastscans)
https://www.youtube.com/watch?v=32zV7-I1JaM
https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-
01-performance-comparison-different-file-formats-and-
storage-engines
APACHE KUDU
What is Apache Kudu?
Designed for fast analytics on fast data
An open-source columnar-oriented data store
• Provides a combination of fast inserts/updates and efficient columnar
scans
• Tables have a structured data model similar to RDMS
• Fast processing of OLAP workloads
• Integration with Hadoop ecosystem (Impala, Spark, Flink)
• Columnar data storage: strongly-typed columns
• Read efficiency: reads by columns, and not by rows
• Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk
• Data compression: because of the strongly—typed columns, compression is more efficient
• Table: the place where data is stored. Split into segments called tablets
• Tablet: contiguous segment of a table (partition)
• Replicated on multiple tablet servers. Leader-follower model
• Tablet server: stores and servers tablets to clients
• Master: keeps track of everything. Leader-follower model
• Catalog table: central location for meta-data.
APACHE KUDU: CONCEPTS
Key concepts of Apache Kudu
Kudu network architecture
Master
Server A (LEADER)
Master
Server B (Follower)
Master
Server C (Follower)
Tablet 2
LEADER
Tablet n
Follower
Tablet
Server F
Tablet 1
Follower
Tablet 2
Follower
Tablet
Server E
Tablet 1
Follower
Tablet 2
Follower
Tablet n
LEADER
Tablet
Server F
Tablet 1
LEADER
Tablet n
Follower
Tablet
Server D
APACHE KUDU : ARCHITECTURE
Master Servers Tablet Servers
Apache Flink
Flink programs are all about operators
How Flink works
source map() keyBy() sink
STREAMING DATAFLOW
SOURCE TRANSFORMATION SINK
OPERATORS
█ FLINK-KUDU-CONNECTOR: A DEEP EXPLANATION
Agenda
FLINK CONNECTORS
out there
flink-connector-
elasticsearch
flink-connector-kafka
flink-connector-redis flink-connector-influxdb
…Flink-connector-
kinesis
flink-connector-hbase flink-jdbc
FLINK-KUDU-CONNECTOR
map() …
STREAMING DATAFLOW
Kudu Table Kudu Table
DataSet and DataStream APIs (batch & streaming)
FLINK IO API
Flink IO
org.apache.flink.api.common.io.InputFormat
• KuduInputFormat
org.apache.flink.api.common.io.OutputFormat
• KuduOutputFormat
org.apache.flink.streaming.api.functions.source.SourceFunction
• Not Implemented
• Kudu does not provide CDC
• Open issue: here
org.apache.flink.streaming.api.functions.sink.SinkFunction
• KuduSinkFunction
CLASS USAGE BASE API
KuduInputFormat PUBLIC InputFormat DataSet
KuduOutputFormat PUBLIC DataSet
KuduSink PUBLIC SinkFunction DataStream
KuduInputSplit INTERNAL InputSplit -
KuduBatchTableSource PUBLIC BatchTableSource Table
KuduTableSink PUBLIC AppendStreamTableSink,
BatchTableSink
Table
KUDU CONNECTOR CLASSES
env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer,
String>>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO)
)
KUDUINPUTFORMAT
sample program
KuduInputFormat: Example
KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName(myTable")
//.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129))
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.flatMap(…)…
KUDUINPUTFORMAT
A program with parallelism = 5, TMs= 5, slots = 1 per TM
Reading data from Kudu in Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
map …
map …
map …
map …
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 3
TM 3, SLOT 1
PARALLEL INSTANCE 5
TM 5, SLOT 1
KUDU TABLE
KUDU MASTER
IDLE DATA SKEW
DataSet dataset = env.fromElements(1,2,3,4,5,5,6);
dataset.map(…)
.combineGroup(…)
.reduceGroup(…)
KUDUOUTPUTFORMAT
sample program
Flink connectors
KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder()
.masterAddress(”localhost")
.tableName("test")
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
.writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT)
.output(new KuduOutputFormat(outputConfig));
KUDUOUTPUTFORMAT
A program with parallelism = 4, TMs= 4, slots = 1 per TM
Writing to Kudu from Flink
PARALLEL INSTANCE 4
TM 4, SLOT 1
map output
map output
map output
map output
PARALLEL INSTANCE 1
TM 1, SLOT 1
PARALLEL INSTANCE 2
TM 2, SLOT 1
PARALLEL INSTANCE 2
TM 3, SLOT 1
TABLET 1
TABLET 2
TABLET 3
TABLET 4
KUDU TABLE
FLINK TABLE API
Kudu working with Flink Table API
Kudu and Table API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>(
// type information
));
tEnv.registerTableSource("Taxis", kuduTableSource);
Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by
f1").as("vehicle,avgSpeed,totalMeasures");
result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>(
BasicTypeInfo.LONG_TYPE_INFO,
BasicTypeInfo.DOUBLE_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
)));
DEMO
• Ongoing contribution to Apache Bahir: https://github.com/apache/bahir-flink/pull/17
• Code samples: https://github.com/0xNacho/kudu-flink-examples/
• flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu
C0DE
Contributions are welcome!
FUTURE WORK
A first version is released, but the team keep working
• Support more data types (currently only Tuples are supported)
• Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is
available in Kudu)
• Fix issue with limit() operator: actually it is a open issue in kudu for the
Java API: KUDU-16, KUDU-2093
THAT’S ALL FOLKS. THANK YOU!

More Related Content

PDF
Bloat and Fragmentation in PostgreSQL
PDF
Introducing Apache Airflow and how we are using it
PDF
ClickHouse Intro
PDF
Parquet Hadoop Summit 2013
PDF
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
PPTX
Introduction to Apache Spark
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Bloat and Fragmentation in PostgreSQL
Introducing Apache Airflow and how we are using it
ClickHouse Intro
Parquet Hadoop Summit 2013
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Introduction to Apache Spark
Apache Spark on K8S Best Practice and Performance in the Cloud
Introduction to Apache Flink - Fast and reliable big data processing

What's hot (20)

PPTX
Ozone- Object store for Apache Hadoop
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PPTX
Apache Flink and what it is used for
PDF
Dynamic Allocation in Spark
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
PDF
Apache Airflow
PPTX
Cassandra an overview
PDF
Bigtable and Dynamo
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PDF
Streaming SQL with Apache Calcite
PDF
Airflow introduction
PDF
Iceberg + Alluxio for Fast Data Analytics
KEY
Introduction to memcached
PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
PPTX
Apache Spark Architecture
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Ozone- Object store for Apache Hadoop
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Apache Flink and what it is used for
Dynamic Allocation in Spark
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Efficient Data Storage for Analytics with Apache Parquet 2.0
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Apache Airflow
Cassandra an overview
Bigtable and Dynamo
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Streaming SQL with Apache Calcite
Airflow introduction
Iceberg + Alluxio for Fast Data Analytics
Introduction to memcached
From cache to in-memory data grid. Introduction to Hazelcast.
Apache Spark Architecture
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Ad

Similar to Apache Flink & Kudu: a connector to develop Kappa architectures (20)

PPT
Map reducecloudtech
ODP
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
PDF
Architecting and productionising data science applications at scale
PDF
Fast and Scalable Python
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
PDF
MapReduce on Zero VM
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PPTX
Putting Lipstick on Apache Pig at Netflix
PPTX
Netflix - Pig with Lipstick by Jeff Magnusson
PPTX
Lipstick On Pig
PDF
Report Hadoop Map Reduce
PPTX
Speed up R with parallel programming in the Cloud
PPT
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PDF
SnappyData Overview Slidedeck for Big Data Bellevue
PPTX
Hackathon bonn
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PPT
Meethadoop
PDF
Fast Data Analytics with Spark and Python
PDF
Data processing platforms with SMACK: Spark and Mesos internals
Map reducecloudtech
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
Architecting and productionising data science applications at scale
Fast and Scalable Python
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
MapReduce on Zero VM
The Future of Hadoop: A deeper look at Apache Spark
Putting Lipstick on Apache Pig at Netflix
Netflix - Pig with Lipstick by Jeff Magnusson
Lipstick On Pig
Report Hadoop Map Reduce
Speed up R with parallel programming in the Cloud
Mapreduce is for Hadoop Ecosystem in Data Science
SnappyData Overview Slidedeck for Big Data Bellevue
Hackathon bonn
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Meethadoop
Fast Data Analytics with Spark and Python
Data processing platforms with SMACK: Spark and Mesos internals
Ad

Recently uploaded (20)

PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Global journeys: estimating international migration
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
A Quantitative-WPS Office.pptx research study
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Taxes Foundatisdcsdcsdon Certificate.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
.pdf is not working space design for the following data for the following dat...
Global journeys: estimating international migration
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to machine learning and Linear Models
Major-Components-ofNKJNNKNKNKNKronment.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
oil_refinery_comprehensive_20250804084928 (1).pptx
A Quantitative-WPS Office.pptx research study
Miokarditis (Inflamasi pada Otot Jantung)
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Apache Flink & Kudu: a connector to develop Kappa architectures

  • 1. FLINK-KUDU-CONNECTOR AN OPEN-SOURCE CONTRIBUTION TO DEVELOP KAPPA ARCHITECTURES
  • 2. WHO WE ARE Team RUBÉN CASADO Big Data Manager, Accenture Digital NACHO GARCÍA Senior Big Data Engineer, Accenture Digital • Big Data Chapter Lead in Accenture Digital Delivery Spain • Apache Flink Madrid Meetup organizer • Master in Big Data Architecture director at Kschool • PhD in Computer Science / Software Engineering • Open-source passionate • Senior Big Data Engineer at Accenture Digital • Lecturer of Master in Big Data at Kschool • Msc in Computer Science Copyright © 2017 Accenture. All rights reserved. ruben_casado ruben.casado.tejedor@accenture.com 0xNacho n.garcia.fernandez@accenture.com
  • 4. ACCENTURE HAS BEEN RANKED AS #1 BIG DATA PROVIDER IN SPAIN BY PENTEO, A VERY PRESTIGIOUS IT BENCHMARKING/ADVISE FIRM, THANKS TO THE DEEP KNOWLEDGE OF BUSINESS REQUIREMENTES IN DIFERENTS DOMAINS, LARGE NETWORK OF BOTH ACADEMIC AND TECHNOLOGIC PARTNERS ALLIANZES AND DELIVERY CAPACITIES Copyright © 2017 Accenture. All rights reserved.
  • 5. PROJECT STARTED BY JUNIOR DEVELOPERS • Learn Apache Flink • Learn Apache Kudu • Encourage the use and contribution to the open source community 1) Open Jira ticket 2) Ticket is assigned to you 3) Open a Pull-Request 4) Overhaul the code 5) Done!
  • 6. █ KAPPA ARCHITECTURE: A REAL NEED Agenda
  • 8. LAMBDA ARCHITECTURE Two processing engines Lambda architecture SERVING LAYER BATCH LAYER QUERIES SPEED LAYER ALL DATA RECENT DATA Real-time view Real-time view Batch view Batch view
  • 9. KAPPA ARCHITECTURE A single engine Kappa Architecture SERVING LAYERSTREAMING LAYER REPLAYABLE QUERIES
  • 10. █ APACHE KUDU AND APACHE FLINK: INTRODUCTION AND FEATURES Agenda
  • 11. APACHE KUDU What is Apache Kudu? Online (fast random access) Analytics(fastscans) https://www.youtube.com/watch?v=32zV7-I1JaM https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017- 01-performance-comparison-different-file-formats-and- storage-engines
  • 12. APACHE KUDU What is Apache Kudu? Designed for fast analytics on fast data An open-source columnar-oriented data store • Provides a combination of fast inserts/updates and efficient columnar scans • Tables have a structured data model similar to RDMS • Fast processing of OLAP workloads • Integration with Hadoop ecosystem (Impala, Spark, Flink)
  • 13. • Columnar data storage: strongly-typed columns • Read efficiency: reads by columns, and not by rows • Very efficient if we need to read just a portion (some columns) of a row. It reads a minimal blocks on disk • Data compression: because of the strongly—typed columns, compression is more efficient • Table: the place where data is stored. Split into segments called tablets • Tablet: contiguous segment of a table (partition) • Replicated on multiple tablet servers. Leader-follower model • Tablet server: stores and servers tablets to clients • Master: keeps track of everything. Leader-follower model • Catalog table: central location for meta-data. APACHE KUDU: CONCEPTS Key concepts of Apache Kudu
  • 14. Kudu network architecture Master Server A (LEADER) Master Server B (Follower) Master Server C (Follower) Tablet 2 LEADER Tablet n Follower Tablet Server F Tablet 1 Follower Tablet 2 Follower Tablet Server E Tablet 1 Follower Tablet 2 Follower Tablet n LEADER Tablet Server F Tablet 1 LEADER Tablet n Follower Tablet Server D APACHE KUDU : ARCHITECTURE Master Servers Tablet Servers
  • 15. Apache Flink Flink programs are all about operators How Flink works source map() keyBy() sink STREAMING DATAFLOW SOURCE TRANSFORMATION SINK OPERATORS
  • 16. █ FLINK-KUDU-CONNECTOR: A DEEP EXPLANATION Agenda
  • 17. FLINK CONNECTORS out there flink-connector- elasticsearch flink-connector-kafka flink-connector-redis flink-connector-influxdb …Flink-connector- kinesis flink-connector-hbase flink-jdbc
  • 18. FLINK-KUDU-CONNECTOR map() … STREAMING DATAFLOW Kudu Table Kudu Table DataSet and DataStream APIs (batch & streaming)
  • 19. FLINK IO API Flink IO org.apache.flink.api.common.io.InputFormat • KuduInputFormat org.apache.flink.api.common.io.OutputFormat • KuduOutputFormat org.apache.flink.streaming.api.functions.source.SourceFunction • Not Implemented • Kudu does not provide CDC • Open issue: here org.apache.flink.streaming.api.functions.sink.SinkFunction • KuduSinkFunction
  • 20. CLASS USAGE BASE API KuduInputFormat PUBLIC InputFormat DataSet KuduOutputFormat PUBLIC DataSet KuduSink PUBLIC SinkFunction DataStream KuduInputSplit INTERNAL InputSplit - KuduBatchTableSource PUBLIC BatchTableSource Table KuduTableSink PUBLIC AppendStreamTableSink, BatchTableSink Table KUDU CONNECTOR CLASSES
  • 21. env.createInput(new KuduInputFormat<>(inputConfig), new TupleTypeInfo<Tuple3<Long, Integer, String>>( BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO) ) KUDUINPUTFORMAT sample program KuduInputFormat: Example KuduInputFormat.Conf inputConfig = KuduInputFormat.Conf.builder() .masterAddress(”localhost") .tableName(myTable") //.addPredicate(new Predicate.PredicateBuilder("vehicle_tag").isEqualTo().val(32129)) .build(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); .flatMap(…)…
  • 22. KUDUINPUTFORMAT A program with parallelism = 5, TMs= 5, slots = 1 per TM Reading data from Kudu in Flink PARALLEL INSTANCE 4 TM 4, SLOT 1 TABLET 1 TABLET 2 TABLET 3 TABLET 4 map … map … map … map … PARALLEL INSTANCE 1 TM 1, SLOT 1 PARALLEL INSTANCE 2 TM 2, SLOT 1 PARALLEL INSTANCE 3 TM 3, SLOT 1 PARALLEL INSTANCE 5 TM 5, SLOT 1 KUDU TABLE KUDU MASTER IDLE DATA SKEW
  • 23. DataSet dataset = env.fromElements(1,2,3,4,5,5,6); dataset.map(…) .combineGroup(…) .reduceGroup(…) KUDUOUTPUTFORMAT sample program Flink connectors KuduOutputFormat.Conf outputConfig = KuduOutputFormat.Conf.builder() .masterAddress(”localhost") .tableName("test") .build(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); .writeMode(KuduOutputFormat.Conf.WriteMode.UPSERT) .output(new KuduOutputFormat(outputConfig));
  • 24. KUDUOUTPUTFORMAT A program with parallelism = 4, TMs= 4, slots = 1 per TM Writing to Kudu from Flink PARALLEL INSTANCE 4 TM 4, SLOT 1 map output map output map output map output PARALLEL INSTANCE 1 TM 1, SLOT 1 PARALLEL INSTANCE 2 TM 2, SLOT 1 PARALLEL INSTANCE 2 TM 3, SLOT 1 TABLET 1 TABLET 2 TABLET 3 TABLET 4 KUDU TABLE
  • 25. FLINK TABLE API Kudu working with Flink Table API Kudu and Table API ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env); KuduBatchTableSource kuduTableSource = new KuduBatchTableSource(conf, new TupleTypeInfo<>( // type information )); tEnv.registerTableSource("Taxis", kuduTableSource); Table result = tEnv.sql("SELECT f1, AVG(f4), COUNT(*) from Taxis group by f1").as("vehicle,avgSpeed,totalMeasures"); result.writeToSink(new KuduTableSink<>(outputConf, new TupleTypeInfo<>( BasicTypeInfo.LONG_TYPE_INFO, BasicTypeInfo.DOUBLE_TYPE_INFO, BasicTypeInfo.LONG_TYPE_INFO )));
  • 26. DEMO
  • 27. • Ongoing contribution to Apache Bahir: https://github.com/apache/bahir-flink/pull/17 • Code samples: https://github.com/0xNacho/kudu-flink-examples/ • flink-connector-kudu: https://github.com/0xNacho/bahir-flink branch: feature/flink-connector-kudu C0DE Contributions are welcome!
  • 28. FUTURE WORK A first version is released, but the team keep working • Support more data types (currently only Tuples are supported) • Implement missing features (i.e. KuduSource, when CDC (KUDU-2180) is available in Kudu) • Fix issue with limit() operator: actually it is a open issue in kudu for the Java API: KUDU-16, KUDU-2093
  • 29. THAT’S ALL FOLKS. THANK YOU!