Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019

Integrating Apache Flink with Apache Hive
Xuefu Zhang, Senior Staff Engineer, Hive PMC
Bowen Li, Senior Engineer
Seattle Apache Flink Meetup - 02/21/2019

Agenda
● Background
● Motivations
● Goals
● Architecture and Design
● Roadmap
● Current Progress
● Demo
● Q&A

Background
● Stream analytic users usually have also offline, batch analytics
● Many batch users want to reduce latency by moving some of analytics to
real-time
● AI is a major driving force behind both real-time and batch analytics (training
a model offline and apply a model in real time)
● ETL is still an important use case for big data
● SQL is the main tool processing big data, streaming or batch

Background (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing

1.7B Events/secEB Total PB Everyday 1T Event/Day

● In Blink, we systematically explored Flink’s capabilities in batch processing
● Flink shows great potential

TPC-DS: Blink v.s. Spark (the Lower, the Better)
Observation: the larger the data size, the more performance advantage Flink has
Performance of Blink versus Spark 2.3.1 in the TPC-DS benchmark, aggregate time for all queries together.
Presentation by Xiaowei Jiang at Flink Forward Beijing, Dec 2018.

Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015

● Flink needs a persistent storage for its metadata

● Hive is the de facto standard for big data/batch processing on Hadoop
● Hive is the center of big data ecosystem with its metadata store
● Streaming users usually have Hive deployment and need to access data/metadata managed by
Hive
● For Hive users, new requirements may come for stream processing

Motivations
● Strengthen Flink’s lead in stream processing
● Unify a solution for both stream and batch processings
● Provide a unified SQL interface for the solution
● Enrich Flink ecosystem
● Promote Flink’s adoption

Goals
● Access all Hive metadata stored in Hive metastore
● Access all Hive data managed by Hive
● Store Flink metadata (both streaming or batch) in Hive metastore
● Compatible with Hive grammar while abiding to SQL standard
● Support user custom objects in Hive continue to work in Flink SQL
○ UDFs
○ serdes
○ storage handlers

Goals (cont’d)
● Feature parity with Hive (paritioning, bucketing, etc)
● Data types compatibility
● Make Flink an alternative, more powerful engine in Hive (longer term)

Integrating Flink with Hive - Goals
Flink DML or Hive DML
Hive Metadata is 100% compatible
(Data types, Tables, Views, UDFs)
Hive Data is 100% accessible

Arichtecture
Flink Deployment
Flink Runtime
Query processing & optimization
Table API and SQL Catalog APIs
SQL Client/Zeppelin

Design
GenericInMemoryCatalog
GenericHiveMetastoreCatalog
ReadableCatalog
ReadableWritableCatalog
HiveCatalog
HiveMetastoreClient
CatalogManager
TableEnvironment
inheritance reference
SQL Client HiveCatalogBase
Hive Metastore
Catalog APIs

Design (cont’d)
BatchTableFactory
HiveTableFactory
BatchTableSource
HiveTableSource
InputFormat
HiveTableInputFormat
BatchTableSink
HiveTableSink
OutputFormat
HiveTableOutputFormat
Read
write
Hive Data

Roadmap
● Basic integration
○ Read access to Hive
○ Table metadata only
○ Simple data types
○ Demo version

Roadmap (cont’d)
● Deep integration
○ Read/Write Hive metadata, data
○ Most data types
○ Basic DDL/DML
○ All meta objects (tables, functions, views, etc)
○ MVP version

Roadmap (cont’d)
● Complete integration
○ Complete DDL/DML
○ All data types
○ Temporary meta objects (functions, tables)
○ Support all user custom objects defined in Hive (serdes, storage handlers)
○ Usability, stability
○ First release

Roadmap (cont’d)
● Longer term
○ Optimizations
○ Feature parity
○ Regular maintenance and releases
● Hive on Flink

Current Progress and Development Plan
Bowen Li

Integrating Flink with Hive - Phases
This is a major change, work needs to be broken into steps
Phase 1: Unified Catalog APIs (FLIP-30, FLINK-11275)
Phase 2: Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Phase 3: Support SQL DDL in Flink (FLINK-10232)

Phase 1: Unified Catalogs APIs
Flink current status:
Barely any catalog support; Cannot connect to external catalogs
What we have done:
Introduced new catalog APIs ReadableCatalog and ReadableWritableCatalog, and framework to
connect to Calcite, that supports
● Meta-Objects
○ Database, Table, View, Partition, Functions, TableStats, and PartitionStats
● Operations
○ Get/List/Exist/Create/Alter/Rename/Drop
Status: Done internally, ready to submit PR to community

No view support, currently views are tables.
Function catalog doesn’t have well-managed hierarchy, and cannot persist UDFs.
What we have done:
● Added true view support, users can create and query views on top of tables in Flink
● Introduced in a new UDF management system with proper namespace and hierarchy for Flink
based on ReadableCatalog/ReadableWritableCatalog
Status: Mostly done internally, ready to submit PR to community

No well-structured hierarchy to manage metadata
Endless nested Calcite schema (think it as db, like db under db under db under …)
What we have done:
● Introduced two-level management and reference structure: <catalog>.<db>.<meta-object>
● Added CatalogManager:
○ manages all registered catalogs in TableEnvironment
○ has a concept of default catalog and default database to simply query
select * from mycatalog.mydb.myTable
=== can be simplified as ===>>>
select * from myTable

No production-ready catalogs
What we have done:
Developed three production-ready catalogs
● GenericInMemoryCatalog
○ in-memory non-persistent, per session, default
● HiveCatalog
○ compatible with Hive, able to read/write Hive meta-objects
● GenericHiveMetastoreCatalog
○ persist Flink streaming and batch meta-objects

Catalogs are pluggable and opens opportunities for
● Catalog for MQ
○ Kafka(Confluent Schema Registry), RabbitMQ, Pulsar, RocketMQ, etc
● Catalog for structured data
○ RDMS like MySQL, etc
● Catalogs for semi-structured data
○ ElasticSearch, HBase, Cassandra, etc
● Catalogs for your other favorite data management system
○ …...

Flink’s batch is great, but cannot run against the most widely used data warehouse - Hive
What we have done:
Developed HiveCatalog
● Flink can read Hive metaobjects, like tables, views, functions, table/partition stats,thru HiveCatalog
● Flink can create Hive metaobjects and write back to Hive via HiveCatalog such that Hive can consume
Phase 2: Flink-Hive Integration - Metadata - HiveCatalog
Flink can read and write Hive metadata thru HiveCatalog

Flink’s metadata cannot be persisted anywhere
Users have to recreate metadata like tables/functions for every new session, very inconvenient
What we have done:
● Persist Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage
Phase 2: Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog

● for Hive batch metadata
● Hive can understand
HiveCatalog v.s. GenericHiveMetastoreCatalog
● for any streaming and batch metadata
● Hive may not understand

Has flink-hcatalog module, but last modified 2 years ago - not really usable.
HCatalog also cannot to access all Hive data
What we have done:
Connector:
○ Developed HiveTableSource that supports reading both partition and non-partition table and
views, and partition-pruning
○ Working on HiveTableSink
Data Types:
○ Added support for all Hive simple data types.
○ Working on supporting Hive complex data types (array, map, struct, etc)
Phase 2: Flink-Hive Integration - Data

● Hive version
○ Officially support Hive 2.3.4 for now
○ We plan to support more Hive versions in the near future
● Above features are partially available in the Blink branch of Flink released in Jan 2019
○ https://github.com/apache/flink/tree/blink
Phase 2: Flink-Hive Integration - Hive Compatibility

● In progress
● Some will be shown in demo
Phase 3: Support SQL DDL + DML in Flink

Example and Demo Time!
Query your Hive data from Flink!

Table API Example
BatchTableEnvironment tEnv = ...
tEnv.registerCatalog(new HiveCatalog("myHive1", "thrift:xxx"));
tEnv.registerCatalog(new HiveCatalog("myHive2", hiveConf));
tEnv.setDefaultDatabase("myHive1", "myDb");
// Read Hive meta-objects
ReadableCatalog myHive1 = tEnv.getCatalog("myHive1");
myHive1.listDatabases();
myHive1.listTables("myDb");
ObjectPath myTablePath = new ObjectPath("myDb", "myTable");
myHive1.getTable(myTablePath);
myHive1.listPartitions(myTablePath);
// Query Hive data
tEnv.sqlQuery("select * from myTable").print()

SQL Client Example
// Register catalogs in sql-cli-defaults.yml

SQL Client Example (cont’)
Flink SQL> SHOW CATALOGS;
myhive1
mygeneric
Flink SQL> USE myhive1.myDb;
Flink SQL> SHOW DATABASES;
myDb
Flink SQL> SHOW TABLES;
myTable
Flink SQL> DRESCRIBE myHiveTable;
...
Flink SQL> SELECT * FROM myHiveView;
...

This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!

● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions

THANKS
https://www.meetup.com/seattle-apache-flink/

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019

More Related Content

What's hot (20)

Similar to Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019 (20)

More from Bowen Li (12)

Recently uploaded (20)

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019