Multi-tenant Flink as-a-service with Kafka on Hopsworks

Multi-Tenant Flink-as-a-Service on YARN
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
Slides by Jim Dowling, Theofilos Kakantousis
Berlin, 13th September 2016
www.hops.io
@hopshadoop

Polyglot Data Parallel Processing
•Stream Processing
- Beam/Flink, Spark
•ETL/Batch Processing
- Spark, MapReduce
•SQL-on-hadoop
- Hive, Presto, SparkSQL
•Distributed ML
- SparkML, FlinkML
•Deep Learning
- Distributed Tensorflow
3

Flink Standalone good enough for some
•Enterprises are polyglot due to economies of scale
•Standalone Flink works great for enterprises
- Dedicate some servers
- Dedicate some SREs
4

Polyglot Data Parallel Processing In Context
5
Data Processing
Spark, MR, Flink, Presto, Tensorflow
Storage
HDFS, MapR, S3, WAS
Resource Management
YARN, Mesos, Kubernetes
Metadata
Hive, Parquet, Authorization, Search

Flink for the Little Guy
•Flink-as-a-Service on Hops Hadoop
- Fully UI Driven, Easy to Install
•Project-Based Multi-tenancy
6
Hops

Flink-as-a-Service running on hops.site
7
SICS ICE: A datacenter research and test environment
Purpose: Increase knowledge, strengthen universities, companies and researchers

HopsFS Architecture
8
NameNodes
NDB
Leader
HDFS Client
DataNodes

Hops-YARN Architecture
9
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Heartbeats
(70-95%)
AM Reqs
(5-30%)

HopsFS Throughput (Spotify Workload)
10
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE.
NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

HopsFS Metadata Scaleout
11Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop

Hopsworks – Project-Based Multi-Tenancy
•A project is a collection of
- Users with Roles
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•Per-Project quotas
- Storage in HDFS
- CPU in YARN
• Uber-style Pricing
•Sharing across Projects
- Datasets/Topics
13
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS

Hopsworks – Dynamic Roles
14
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
Glassfish
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates

Look Ma, No Kerberos!
•For each project, a user is issued with a X.509
certificate, containing the project-specific userID.
•Services are also issued with X.509 certificates.
- Both user and service certs are signed with the same CA.
- Services extract the userID from RPCs to identify the caller.
•Netflix’ BLESS system is a similar model, with short-
lived certificates.

X.509 Certificate Per Project-Specific User
16
Alice@gmail.com
Authenticate
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests

Flink on YARN
•Two modes: detached or blocking
•Hopsworks supports detached mode
- Client started locally, then exits after the job is submitted
to YARN
- No accumulator results or exceptions from the
ExecutionEnvironment.execute()
- Can only kill YARN job, not Flink session. Cleanup issues.
•New Architecture proposal for a Flink Dispatcher

A Flink/Kafka Job on YARN with Hopsworks
18
Alice@gmail.com
1. Launch Flink Job
Distributed
Database
2. Get certs,
service endpoints
YARN Private
LocalResources
Flink/Kafka Streaming App
4. Materialize certs
3. YARN Job + config
6. Get Schema
7. Consume
Produce
5. Read Certs
Hopsworks
KafkaUtil

Flink Stream Producer in Secure Kafka
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
String topic = parameterTool.get("topic");
1. Discover: Schema Registry and Kafka Broker Endpoints
2. Create: Kafka Properties file with certs and broker details
3. Create: producer using Kafka Properties
4. Distribute: X.509 certs to all hosts on the cluster
5. Download: the Schema for the Topic from the Schema Registry
6. Do this all securely
DataStream<…> messageStream = env.addSource(…);
messageStream.addSink(producer);
env.execute("Write to Kafka");
19
Developer
Operations

Flink/Kafka Stream Producer in Hopsworks
FlinkProducer producer = KafkaUtil.getFlinkProducer(topic);
DataStream<…> messageStream = env.addSource(…);
messageStream.addSink(producer);
env.execute("Write to Kafka");
20https://github.com/hopshadoop/hops-kafka-examples

Flink/Kafka Stream Consumer in Hopsworks
FlinkConsumer consumer = KafkaUtil.getFlinkConsumer(topic);
DataStream<…> messageStream = env.addSource(consumer);
RollingSink<String> rollingSink = ... // HDFS path
messageStream.addSink(rollingSink);
env.execute(“Read from Kafka, write to HDFS");
21https://github.com/hopshadoop/hops-kafka-examples

Karamel/Chef for Automated Installation
23
Google Compute Engine BareMetal

Summary
•Hopsworks provides first-class support for
Flink-as-a-Service
- Streaming or Batch Jobs
- Zeppelin Notebooks
•Hopworks simplifies secure use of Kafka in Flink on
YARN
•YARN support for Flink still a work-in-progress
25

Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Konstantin Popov, Antonios Kouzoupis.
Ermias Gebremeskel, Daniel Bekele
Alumni: Vasileios Giannokostas, Misganu Dessalegn,
Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
K “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Hops
[Hadoop For Humans]
Join us!
http://github.com/hopshadoop

Multi-tenant Flink as-a-service with Kafka on Hopsworks

More Related Content

What's hot (20)

Similar to Multi-tenant Flink as-a-service with Kafka on Hopsworks (20)

More from Jim Dowling (20)

Recently uploaded (20)

Multi-tenant Flink as-a-service with Kafka on Hopsworks

Editor's Notes