SlideShare a Scribd company logo
Integrating Apache Flink with Apache Hive
Xuefu Zhang, Senior Staff Engineer, Hive PMC
Bowen Li, Senior Engineer
Seattle Apache Flink Meetup - 02/21/2019
Agenda
● Background
● Motivations
● Goals
● Architecture and Design
● Roadmap
● Current Progress
● Demo
● Q&A
Background
● Stream analytic users usually have also offline, batch analytics
● Many batch users want to reduce latency by moving some of analytics to
real-time
● AI is a major driving force behind both real-time and batch analytics (training
a model offline and apply a model in real time)
● ETL is still an important use case for big data
● SQL is the main tool processing big data, streaming or batch
Background (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing
1.7B Events/secEB Total PB Everyday 1T Event/Day
Background (cont’d)
● In Blink, we systematically explored Flink’s capabilities in batch processing
● Flink shows great potential
TPC-DS: Blink v.s. Spark (the Lower, the Better)
Observation: the larger the data size, the more performance advantage Flink has
Performance of Blink versus Spark 2.3.1 in the TPC-DS benchmark, aggregate time for all queries together.
Presentation by Xiaowei Jiang at Flink Forward Beijing, Dec 2018.
Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
Background (cont’d)
● Flink needs a persistent storage for its metadata
Background (cont’d)
● Hive is the de facto standard for big data/batch processing on Hadoop
● Hive is the center of big data ecosystem with its metadata store
● Streaming users usually have Hive deployment and need to access data/metadata managed by
Hive
● For Hive users, new requirements may come for stream processing
Integrating Flink with Hive
Motivations
● Strengthen Flink’s lead in stream processing
● Unify a solution for both stream and batch processings
● Provide a unified SQL interface for the solution
● Enrich Flink ecosystem
● Promote Flink’s adoption
Goals
● Access all Hive metadata stored in Hive metastore
● Access all Hive data managed by Hive
● Store Flink metadata (both streaming or batch) in Hive metastore
● Compatible with Hive grammar while abiding to SQL standard
● Support user custom objects in Hive continue to work in Flink SQL
○ UDFs
○ serdes
○ storage handlers
Goals (cont’d)
● Feature parity with Hive (paritioning, bucketing, etc)
● Data types compatibility
● Make Flink an alternative, more powerful engine in Hive (longer term)
Integrating Flink with Hive - Goals
Flink DML or Hive DML
Hive Metadata is 100% compatible
(Data types, Tables, Views, UDFs)
Hive Data is 100% accessible
Arichtecture
Flink Deployment
Flink Runtime
Query processing & optimization
Table API and SQL Catalog APIs
SQL Client/Zeppelin
Design
GenericInMemoryCatalog
GenericHiveMetastoreCatalog
ReadableCatalog
ReadableWritableCatalog
HiveCatalog
HiveMetastoreClient
CatalogManager
TableEnvironment
inheritance reference
SQL Client HiveCatalogBase
Hive Metastore
Catalog APIs
Design (cont’d)
BatchTableFactory
HiveTableFactory
BatchTableSource
HiveTableSource
InputFormat
HiveTableInputFormat
BatchTableSink
HiveTableSink
OutputFormat
HiveTableOutputFormat
Read
write
Hive Data
Roadmap
● Basic integration
○ Read access to Hive
○ Table metadata only
○ Simple data types
○ Demo version
Roadmap (cont’d)
● Deep integration
○ Read/Write Hive metadata, data
○ Most data types
○ Basic DDL/DML
○ All meta objects (tables, functions, views, etc)
○ MVP version
Roadmap (cont’d)
● Complete integration
○ Complete DDL/DML
○ All data types
○ Temporary meta objects (functions, tables)
○ Support all user custom objects defined in Hive (serdes, storage handlers)
○ Usability, stability
○ First release
Roadmap (cont’d)
● Longer term
○ Optimizations
○ Feature parity
○ Regular maintenance and releases
● Hive on Flink
Current Progress and Development Plan
Bowen Li
Integrating Flink with Hive - Phases
This is a major change, work needs to be broken into steps
Phase 1: Unified Catalog APIs (FLIP-30, FLINK-11275)
Phase 2: Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Phase 3: Support SQL DDL in Flink (FLINK-10232)
Phase 1: Unified Catalogs APIs
Flink current status:
Barely any catalog support; Cannot connect to external catalogs
What we have done:
Introduced new catalog APIs ReadableCatalog and ReadableWritableCatalog, and framework to
connect to Calcite, that supports
● Meta-Objects
○ Database, Table, View, Partition, Functions, TableStats, and PartitionStats
● Operations
○ Get/List/Exist/Create/Alter/Rename/Drop
Status: Done internally, ready to submit PR to community
Phase 1: Unified Catalogs APIs
Flink current status:
No view support, currently views are tables.
Function catalog doesn’t have well-managed hierarchy, and cannot persist UDFs.
What we have done:
● Added true view support, users can create and query views on top of tables in Flink
● Introduced in a new UDF management system with proper namespace and hierarchy for Flink
based on ReadableCatalog/ReadableWritableCatalog
Status: Mostly done internally, ready to submit PR to community
Phase 1: Unified Catalogs APIs
Flink current status:
No well-structured hierarchy to manage metadata
Endless nested Calcite schema (think it as db, like db under db under db under …)
What we have done:
● Introduced two-level management and reference structure: <catalog>.<db>.<meta-object>
● Added CatalogManager:
○ manages all registered catalogs in TableEnvironment
○ has a concept of default catalog and default database to simply query
select * from mycatalog.mydb.myTable
=== can be simplified as ===>>>
select * from myTable
Status: Done internally, ready to submit PR to community
Flink current status:
No production-ready catalogs
What we have done:
Developed three production-ready catalogs
● GenericInMemoryCatalog
○ in-memory non-persistent, per session, default
● HiveCatalog
○ compatible with Hive, able to read/write Hive meta-objects
● GenericHiveMetastoreCatalog
○ persist Flink streaming and batch meta-objects
Status: Done internally, ready to submit PR to community
Phase 1: Unified Catalogs APIs
Catalogs are pluggable and opens opportunities for
● Catalog for MQ
○ Kafka(Confluent Schema Registry), RabbitMQ, Pulsar, RocketMQ, etc
● Catalog for structured data
○ RDMS like MySQL, etc
● Catalogs for semi-structured data
○ ElasticSearch, HBase, Cassandra, etc
● Catalogs for your other favorite data management system
○ …...
Phase 1: Unified Catalogs APIs
Flink current status:
Flink’s batch is great, but cannot run against the most widely used data warehouse - Hive
What we have done:
Developed HiveCatalog
● Flink can read Hive metaobjects, like tables, views, functions, table/partition stats,thru HiveCatalog
● Flink can create Hive metaobjects and write back to Hive via HiveCatalog such that Hive can consume
Phase 2: Flink-Hive Integration - Metadata - HiveCatalog
Flink can read and write Hive metadata thru HiveCatalog
Flink current status:
Flink’s metadata cannot be persisted anywhere
Users have to recreate metadata like tables/functions for every new session, very inconvenient
What we have done:
● Persist Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage
Phase 2: Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
● for Hive batch metadata
● Hive can understand
HiveCatalog v.s. GenericHiveMetastoreCatalog
● for any streaming and batch metadata
● Hive may not understand
Flink current status:
Has flink-hcatalog module, but last modified 2 years ago - not really usable.
HCatalog also cannot to access all Hive data
What we have done:
Connector:
○ Developed HiveTableSource that supports reading both partition and non-partition table and
views, and partition-pruning
○ Working on HiveTableSink
Data Types:
○ Added support for all Hive simple data types.
○ Working on supporting Hive complex data types (array, map, struct, etc)
Phase 2: Flink-Hive Integration - Data
● Hive version
○ Officially support Hive 2.3.4 for now
○ We plan to support more Hive versions in the near future
● Above features are partially available in the Blink branch of Flink released in Jan 2019
○ https://github.com/apache/flink/tree/blink
Phase 2: Flink-Hive Integration - Hive Compatibility
● In progress
● Some will be shown in demo
Phase 3: Support SQL DDL + DML in Flink
Example and Demo Time!
Query your Hive data from Flink!
Table API Example
BatchTableEnvironment tEnv = ...
tEnv.registerCatalog(new HiveCatalog("myHive1", "thrift:xxx"));
tEnv.registerCatalog(new HiveCatalog("myHive2", hiveConf));
tEnv.setDefaultDatabase("myHive1", "myDb");
// Read Hive meta-objects
ReadableCatalog myHive1 = tEnv.getCatalog("myHive1");
myHive1.listDatabases();
myHive1.listTables("myDb");
ObjectPath myTablePath = new ObjectPath("myDb", "myTable");
myHive1.getTable(myTablePath);
myHive1.listPartitions(myTablePath);
// Query Hive data
tEnv.sqlQuery("select * from myTable").print()
SQL Client Example
// Register catalogs in sql-cli-defaults.yml
SQL Client Example (cont’)
Flink SQL> SHOW CATALOGS;
myhive1
mygeneric
Flink SQL> USE myhive1.myDb;
Flink SQL> SHOW DATABASES;
myDb
Flink SQL> SHOW TABLES;
myTable
Flink SQL> DRESCRIBE myHiveTable;
...
Flink SQL> SELECT * FROM myHiveView;
...
Happy Live Demo on SQL CLI!
This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!
● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions
THANKS
https://www.meetup.com/seattle-apache-flink/

More Related Content

PDF
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PDF
Introduction to Structured Streaming
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Introduction to Structured Streaming
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

What's hot (20)

PDF
Introduction to Flink Streaming
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PDF
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
PPTX
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
PPTX
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PPTX
SICS: Apache Flink Streaming
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
PDF
From Batch to Streaming with Apache Apex Dataworks Summit 2017
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
Flink Apachecon Presentation
PDF
Marton Balassi – Stateful Stream Processing
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
PPTX
Introduction to Apache Apex
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PDF
Maximilian Michels - Flink and Beam
Introduction to Flink Streaming
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Taking a look under the hood of Apache Flink's relational APIs.
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
SICS: Apache Flink Streaming
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
From Batch to Streaming with Apache Apex Dataworks Summit 2017
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Apachecon Presentation
Marton Balassi – Stateful Stream Processing
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Introduction to Apache Apex
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Maximilian Michels - Flink and Beam
Ad

Similar to Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019 (20)

PDF
Integrating Flink with Hive - Flink Forward SF 2019
PDF
Flink and Hive integration - unifying enterprise data processing systems
PDF
Flink 2.0: Navigating the Future of Unified Stream and Batch Processing
PDF
Rennes Meetup 2019-09-26 - Change data capture in production
PDF
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
PDF
Flink Forward Europe 2019 - Berlin
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PPTX
Apache Flink: Past, Present and Future
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
What's new for Apache Flink's Table & SQL APIs?
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PDF
Rivivi il Data in Motion Tour Milano 2024
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
PDF
Tapping into Scientific Data with Hadoop and Flink
PDF
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
PDF
Apache Flink - a Gentle Start
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
PDF
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Integrating Flink with Hive - Flink Forward SF 2019
Flink and Hive integration - unifying enterprise data processing systems
Flink 2.0: Navigating the Future of Unified Stream and Batch Processing
Rennes Meetup 2019-09-26 - Change data capture in production
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Flink Forward Europe 2019 - Berlin
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Why apache Flink is the 4G of Big Data Analytics Frameworks
Apache Flink: Past, Present and Future
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
What's new for Apache Flink's Table & SQL APIs?
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Rivivi il Data in Motion Tour Milano 2024
Webinar: Flink SQL in Action - Fabian Hueske
Tapping into Scientific Data with Hadoop and Flink
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Apache Flink - a Gentle Start
Unified Batch and Real-Time Stream Processing Using Apache Flink
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Ad

More from Bowen Li (12)

PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PDF
How to contribute to Apache Flink @ Seattle Flink meetup
PDF
Community update on flink 1.9 and How to Contribute to Flink
PDF
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
PDF
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
PDF
Status Update of Seattle Flink Meetup, Jun 2018
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
PDF
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
PDF
Stream processing with Apache Flink @ OfferUp
PDF
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
PDF
Opening - Seattle Apache Flink Meetup
Apache Flink 101 - the rise of stream processing and beyond
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
How to contribute to Apache Flink @ Seattle Flink meetup
Community update on flink 1.9 and How to Contribute to Flink
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Status Update of Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Stream processing with Apache Flink @ OfferUp
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink Meetup

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced IT Governance
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Modernizing your data center with Dell and AMD
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced IT Governance
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
Modernizing your data center with Dell and AMD
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Monthly Chronicles - July 2025
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019

  • 1. Integrating Apache Flink with Apache Hive Xuefu Zhang, Senior Staff Engineer, Hive PMC Bowen Li, Senior Engineer Seattle Apache Flink Meetup - 02/21/2019
  • 2. Agenda ● Background ● Motivations ● Goals ● Architecture and Design ● Roadmap ● Current Progress ● Demo ● Q&A
  • 3. Background ● Stream analytic users usually have also offline, batch analytics ● Many batch users want to reduce latency by moving some of analytics to real-time ● AI is a major driving force behind both real-time and batch analytics (training a model offline and apply a model in real time) ● ETL is still an important use case for big data ● SQL is the main tool processing big data, streaming or batch
  • 4. Background (cont’d) ● Flink has showed prevailing advantages over other solutions for heavy-volume stream processing
  • 5. 1.7B Events/secEB Total PB Everyday 1T Event/Day
  • 6. Background (cont’d) ● In Blink, we systematically explored Flink’s capabilities in batch processing ● Flink shows great potential
  • 7. TPC-DS: Blink v.s. Spark (the Lower, the Better) Observation: the larger the data size, the more performance advantage Flink has Performance of Blink versus Spark 2.3.1 in the TPC-DS benchmark, aggregate time for all queries together. Presentation by Xiaowei Jiang at Flink Forward Beijing, Dec 2018.
  • 8. Flink is the fastest due to its pipelined execution Tez and Spark do not overlap 1st and 2nd stages MapReduce is slow despite overlapping stages A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
  • 9. Background (cont’d) ● Flink needs a persistent storage for its metadata
  • 10. Background (cont’d) ● Hive is the de facto standard for big data/batch processing on Hadoop ● Hive is the center of big data ecosystem with its metadata store ● Streaming users usually have Hive deployment and need to access data/metadata managed by Hive ● For Hive users, new requirements may come for stream processing
  • 12. Motivations ● Strengthen Flink’s lead in stream processing ● Unify a solution for both stream and batch processings ● Provide a unified SQL interface for the solution ● Enrich Flink ecosystem ● Promote Flink’s adoption
  • 13. Goals ● Access all Hive metadata stored in Hive metastore ● Access all Hive data managed by Hive ● Store Flink metadata (both streaming or batch) in Hive metastore ● Compatible with Hive grammar while abiding to SQL standard ● Support user custom objects in Hive continue to work in Flink SQL ○ UDFs ○ serdes ○ storage handlers
  • 14. Goals (cont’d) ● Feature parity with Hive (paritioning, bucketing, etc) ● Data types compatibility ● Make Flink an alternative, more powerful engine in Hive (longer term)
  • 15. Integrating Flink with Hive - Goals Flink DML or Hive DML Hive Metadata is 100% compatible (Data types, Tables, Views, UDFs) Hive Data is 100% accessible
  • 16. Arichtecture Flink Deployment Flink Runtime Query processing & optimization Table API and SQL Catalog APIs SQL Client/Zeppelin
  • 19. Roadmap ● Basic integration ○ Read access to Hive ○ Table metadata only ○ Simple data types ○ Demo version
  • 20. Roadmap (cont’d) ● Deep integration ○ Read/Write Hive metadata, data ○ Most data types ○ Basic DDL/DML ○ All meta objects (tables, functions, views, etc) ○ MVP version
  • 21. Roadmap (cont’d) ● Complete integration ○ Complete DDL/DML ○ All data types ○ Temporary meta objects (functions, tables) ○ Support all user custom objects defined in Hive (serdes, storage handlers) ○ Usability, stability ○ First release
  • 22. Roadmap (cont’d) ● Longer term ○ Optimizations ○ Feature parity ○ Regular maintenance and releases ● Hive on Flink
  • 23. Current Progress and Development Plan Bowen Li
  • 24. Integrating Flink with Hive - Phases This is a major change, work needs to be broken into steps Phase 1: Unified Catalog APIs (FLIP-30, FLINK-11275) Phase 2: Integrate Flink with Hive (FLINK-10556) ● for metadata thru Hive Metastore (FLINK-10744) ● for data (FLINK-10729) Phase 3: Support SQL DDL in Flink (FLINK-10232)
  • 25. Phase 1: Unified Catalogs APIs Flink current status: Barely any catalog support; Cannot connect to external catalogs What we have done: Introduced new catalog APIs ReadableCatalog and ReadableWritableCatalog, and framework to connect to Calcite, that supports ● Meta-Objects ○ Database, Table, View, Partition, Functions, TableStats, and PartitionStats ● Operations ○ Get/List/Exist/Create/Alter/Rename/Drop Status: Done internally, ready to submit PR to community
  • 26. Phase 1: Unified Catalogs APIs Flink current status: No view support, currently views are tables. Function catalog doesn’t have well-managed hierarchy, and cannot persist UDFs. What we have done: ● Added true view support, users can create and query views on top of tables in Flink ● Introduced in a new UDF management system with proper namespace and hierarchy for Flink based on ReadableCatalog/ReadableWritableCatalog Status: Mostly done internally, ready to submit PR to community
  • 27. Phase 1: Unified Catalogs APIs Flink current status: No well-structured hierarchy to manage metadata Endless nested Calcite schema (think it as db, like db under db under db under …) What we have done: ● Introduced two-level management and reference structure: <catalog>.<db>.<meta-object> ● Added CatalogManager: ○ manages all registered catalogs in TableEnvironment ○ has a concept of default catalog and default database to simply query select * from mycatalog.mydb.myTable === can be simplified as ===>>> select * from myTable Status: Done internally, ready to submit PR to community
  • 28. Flink current status: No production-ready catalogs What we have done: Developed three production-ready catalogs ● GenericInMemoryCatalog ○ in-memory non-persistent, per session, default ● HiveCatalog ○ compatible with Hive, able to read/write Hive meta-objects ● GenericHiveMetastoreCatalog ○ persist Flink streaming and batch meta-objects Status: Done internally, ready to submit PR to community Phase 1: Unified Catalogs APIs
  • 29. Catalogs are pluggable and opens opportunities for ● Catalog for MQ ○ Kafka(Confluent Schema Registry), RabbitMQ, Pulsar, RocketMQ, etc ● Catalog for structured data ○ RDMS like MySQL, etc ● Catalogs for semi-structured data ○ ElasticSearch, HBase, Cassandra, etc ● Catalogs for your other favorite data management system ○ …... Phase 1: Unified Catalogs APIs
  • 30. Flink current status: Flink’s batch is great, but cannot run against the most widely used data warehouse - Hive What we have done: Developed HiveCatalog ● Flink can read Hive metaobjects, like tables, views, functions, table/partition stats,thru HiveCatalog ● Flink can create Hive metaobjects and write back to Hive via HiveCatalog such that Hive can consume Phase 2: Flink-Hive Integration - Metadata - HiveCatalog Flink can read and write Hive metadata thru HiveCatalog
  • 31. Flink current status: Flink’s metadata cannot be persisted anywhere Users have to recreate metadata like tables/functions for every new session, very inconvenient What we have done: ● Persist Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage Phase 2: Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
  • 32. ● for Hive batch metadata ● Hive can understand HiveCatalog v.s. GenericHiveMetastoreCatalog ● for any streaming and batch metadata ● Hive may not understand
  • 33. Flink current status: Has flink-hcatalog module, but last modified 2 years ago - not really usable. HCatalog also cannot to access all Hive data What we have done: Connector: ○ Developed HiveTableSource that supports reading both partition and non-partition table and views, and partition-pruning ○ Working on HiveTableSink Data Types: ○ Added support for all Hive simple data types. ○ Working on supporting Hive complex data types (array, map, struct, etc) Phase 2: Flink-Hive Integration - Data
  • 34. ● Hive version ○ Officially support Hive 2.3.4 for now ○ We plan to support more Hive versions in the near future ● Above features are partially available in the Blink branch of Flink released in Jan 2019 ○ https://github.com/apache/flink/tree/blink Phase 2: Flink-Hive Integration - Hive Compatibility
  • 35. ● In progress ● Some will be shown in demo Phase 3: Support SQL DDL + DML in Flink
  • 36. Example and Demo Time! Query your Hive data from Flink!
  • 37. Table API Example BatchTableEnvironment tEnv = ... tEnv.registerCatalog(new HiveCatalog("myHive1", "thrift:xxx")); tEnv.registerCatalog(new HiveCatalog("myHive2", hiveConf)); tEnv.setDefaultDatabase("myHive1", "myDb"); // Read Hive meta-objects ReadableCatalog myHive1 = tEnv.getCatalog("myHive1"); myHive1.listDatabases(); myHive1.listTables("myDb"); ObjectPath myTablePath = new ObjectPath("myDb", "myTable"); myHive1.getTable(myTablePath); myHive1.listPartitions(myTablePath); // Query Hive data tEnv.sqlQuery("select * from myTable").print()
  • 38. SQL Client Example // Register catalogs in sql-cli-defaults.yml
  • 39. SQL Client Example (cont’) Flink SQL> SHOW CATALOGS; myhive1 mygeneric Flink SQL> USE myhive1.myDb; Flink SQL> SHOW DATABASES; myDb Flink SQL> SHOW TABLES; myTable Flink SQL> DRESCRIBE myHiveTable; ... Flink SQL> SELECT * FROM myHiveView; ...
  • 40. Happy Live Demo on SQL CLI!
  • 41. This tremendous amount of work cannot happen without help and support Shout out to everyone in the community and our team who have been helping us with designs, codes, feedbacks, etc!
  • 42. ● Flink is good at stream processing, but batch processing is equally important ● Flink has shown its potential in batch processing ● Flink/Hive integration benefits both communities ● This is a big effort ● We are taking a phased approach ● Your contribution is greatly welcome and appreciated! Conclusions