SlideShare a Scribd company logo
Spark Streaming-as-a-
Service with Kafka and
YARN
Jim Dowling
KTH Royal Institute of Technology, Stockholm
Senior Researcher, SICS
CEO, Logical Clocks AB
Spark Streaming-as-a-Service in Sweden
• SICS ICE: datacenter research environment
• Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop
• -as-a-service
– Built on Hops Hadoop (www.hops.io)
– >130 active users
Hadoop is not a cool kid anymore!
Hadoop’s Evolution
2009 2016
?
Hadoop’s Evolution
2009 2016
?
Tiny Brain
(NameNode, ResourceMgr)
Huge Body (DataNodes)
Build out Hadoop’s Brain with External
Weakly Consistent MetaData Services
Google-Glass Approach to Intelligence
NameNodes
NDB
HDFS Client
DataNodes
>37X Capacity
>16 X
Throughput
HopsFS
Larger Brains => Bigger, Faster*
16x
Performance
on Spotify Workload
*Usenix FAST 2017, HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Hopsworks
• Projects
– Datasets/Files
– Topics
– Jobs/Notebooks
Hadoop
• Clusters
• Users
• Jobs/Applications
• Files
• ACLs
• Sys Admins
• Kerberos
Larger Brains => More Intelligent*
*HMGA2 gene mutations correlated with increased intracranial volume as well as enhanced IQ.
http://newsroom.ucla.edu/releases/international-team-uncovers-new-231989
User-Friendly Concepts
http://www.ibtimes.co.uk/embargoed-8pm-25th-jan-size-matters-brain-size-relative-body-size-indicates-animals-ability-1539994
YARN Spark Streaming Support
• Apache Kafka
• ELK Stack
– Real-time Logs
• Grafana/InfluxDB
– Monitoring
Hopsworks
YARN aggregates logs on job completion
http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
Kafka Self-Service UI
Manage & Share
• Topics
• ACLs
• Avro Schemas
Manage & Share
• Topics
• ACLs
• Avro Schemas
Logs
Elasticsearch,
Logstash,
Kibana
(ELK Stack)
Elasticsearch,
Logstash,
Kibana
(ELK Stack)
Monitoring/Alerting
InfluxDB
and
Grafana
InfluxDB
and
Grafana
metrics.properties: StreamingMetrics.streaming.lastReceivedBatch_records == 0
Zeppelin for Prototyping Streaming Apps
[https://github.com/knockdata/spark-highcharts]
Debugging Spark with Dr. Elephant
• Analyzes Spark Jobs
for errors and
common using
pluggable heuristics
• Doesn’t show killed
jobs
• No online support for
streaming apps yet
Integration as Microservices in Hopsworks
• Project-based Multi-tenancy
• Self-Service UI
• Simplifying Spark Streaming Apps
Proj-All
Proj-X
Projects in Hopsworks
•
Proj-42
Shared TopicTopic /Projs/My/Data
CompanyDB
User roles
18
Data Owner
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist
- Write and Run code
Self-Service Administration – No Administrator Needed
Notebooks, Data sharing and Quotas
• Zeppelin Notebooks in HDFS, Jobs launcher UI.
• Sharing is not Copying
– Datasets/Topics
• Per-Project quotas
– Storage in HDFS
– CPU in YARN (Uber-style Pricing)
Dynamic roles
alice@gmail.com
ProjectA
Authenticate
ProjectB
HopsFS
YARN
Kafka
SSL/TLS
Certificates
Secure
Impersonation
ProjectA__alice
ProjectB__alice
Look Ma, no Kerberos
• Each project-specific user issued with a SSL/TLS
(X.509) certificate for both authentication and encryption.
• Services also issued with SSL/TLS certificates.
– Same root CA as user certs
Simplifying Spark Streaming Apps
• Spark Streaming Applications need to know
– Credentials
• Hadoop, Kafka, InfluxDb, Logstash
– Endpoints
• Kafka Broker, Kafka SchemaRegistry, ResourceManager,
NameNode, InfluxDB, Logstash
• The HopsUtil API hides this complexity.
– Location/security transparent Spark applications
Secure Streaming App with Kafka
Developer
1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints
2.Create: Kafka Properties file with certs and broker details
3.Create: Producer/Consumer using Kafka Properties
4.Download: the Schema for the Topic from the Schema Registry
5.Distribute: X.509 certs to all hosts on the cluster
6.Cleanup securely
These steps are replaced by calls to the HopsUtil API
Operations
https://github.com/hopshadoop/hops-kafka-examples
Streaming Producer in HopsWorks
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
String topic = HopsUtil.getTopic(); //Optional
SparkProducer producer = HopsUtil.getSparkProducer();
Map<String, String> message = …
sparkProducer.produce(message);
Streaming Consumer in HopsWorks
JavaStreamingContext jssc = new
JavaStreamingContext(sparkConf,Durations.seconds(2));
String topic = HopsUtil.getTopic(); //Optional
String consumerGroup = HopsUtil.getConsumerGroup(); //Optional
SparkConsumer consumer = HopsUtil.getSparkConsumer(jssc);
JavaInputDStream<ConsumerRecord<String, byte[]>> messages =
consumer.createDirectStream();
jssc.start();
Less code to write
https://github.com/hopshadoop/hops-kafka-examples
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
props.put(SCHEMA_REGISTRY_URL, restApp.restConnect);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
org.apache.kafka.common.serialization.StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("producer.type", "sync");
props.put("serializer.class","kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks"
)
props.put("ssl.keystore.password","test1234")
props.put("ssl.key.password","test1234")
ProducerConfig config = new ProducerConfig(props);
String userSchema =
"{"namespace": "example.avro", "type": "record", "name": "U
ser"," +
""fields":
[{"name": "name", "type": "string"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);
GenericRecord avroRecord = new GenericData.Record(schema);
avroRecord.put("name", "testUser");
Producer<String, String> producer = new Producer<String,
String>(config);
ProducerRecord<String, Object> message = new
ProducerRecord<>(“topicName”, avroRecord );
producer.send(data);
Lots of Hard-Coded Endpoints Here!
SparkProducer producer =
HopsUtil.getSparkProducer();
Map<String, String> message = …
sparkProducer.produce(message);
Massively Simplified Code for
Secure Spark Streaming/Kafka
Distributing Certs for Spark Streaming
Alice@gmail.com
1. Launch Spark Job
Distributed
Database
2. Get certs,
service endpoints
YARN Private
LocalResources
Spark Streaming App
4. Materialize certs
3. YARN Job, config
6. Get Schema
7. Consume
Produce
5. Read Certs
Hopsworks
HopsUtil
8. Read ACLs for
authentication
Multi-Tenant IoT Scenario
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Sensor
Node
Field Gateway
StorageStorage
AnalysisAnalysis
IngestionIngestion
ACMEACME
Evil CorpEvil Corp
IoT Cloud Platform
DontBeEvil
Corp
DontBeEvil
Corp
IoT Scenario
ACME DontBeEvil Corp Evil-Corp
AWS Google
Cloud
Oracle
Cloud
User Apps control IoT Devices
IoT Company:
Analyze Data,
Data Services
for Clients
ACME DontBeEvil Corp Evil Corp
Cloud-Native Analytics Solution
ACME S3S3
[Authorization]
GCSGCS
OracleOracleIoT Company
Each customer needs its own
Analytics Infrastructure
Each customer needs its own
Analytics Infrastructure
Spark
Streaming App
IoT Company
Project
GatewayTopic
Hopsworks Solution using Projects
ACME
ProjectACMETopic
ACME Dataset
Data Stream
Analytics Reports
Hopsworks Solution
ACME
Spark
Streaming App
[Authorized]
ACME
Dataset
ACME
Dataset
ACME Topic
ACME Analytics
Reports
ACME Analytics
Reports
Spark Batch
Job
ACME Project
Karamel/Chef for Automated Installation
Google Compute Engine
BareMetal
DEMO
Hops Roadmap
• HopsFS
– HA support for Multi-Data-Center
– Small files, 2-Level Erasure Coding
• HopsYARN
– Tensorflow with isolated GPUs
• Hopsworks
– P2P Dataset Sharing
– Jupyter, Presto, Hive
Summary
• Hops is a new distribution of Hadoop
– Tinker-friendly and open-source.
• Hopsworks provides first-class support for
Spark-Streaming-as-a-Service
– With support services like Kafka, ELK Stack,
Zeppelin, Grafana/InfluxDB.
Hops Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto
Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid,
Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas
Siskos.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan
Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali
Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt,
Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler,
Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Hops
Thank You.
We totally understand it’s going to be
America First Spark Streaming first, but
can we take this chance to say
Hopsworks second!
http://www.hops.io
@hopshadoop
Hops

More Related Content

PPTX
Data Ingestion At Scale (CNECCS 2017)
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PDF
fluentd -- the missing log collector
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
PDF
ApacheCon 2021 Apache Deep Learning 302
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
PDF
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
Data Ingestion At Scale (CNECCS 2017)
Using FLiP with influxdb for edgeai iot at scale 2022
fluentd -- the missing log collector
Cloud lunch and learn real-time streaming in azure
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)

What's hot (19)

PDF
FLiP Into Trino
PDF
Cracking the nut, solving edge ai with apache tools and frameworks
PPTX
Matt Franklin - Apache Software (Geekfest)
PDF
Pass data community summit - 2021 - Real-Time Streaming in Azure with Apache ...
PDF
ApacheCon 2021: Apache NiFi 101- introduction and best practices
PDF
Using the flipn stack for edge ai (flink, nifi, pulsar)
PDF
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
PDF
Using the FLiPN stack for edge ai (flink, nifi, pulsar)
PDF
StreamNative FLiP into scylladb - scylla summit 2022
PDF
Music city data Hail Hydrate! from stream to lake
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
PDF
Hail hydrate! from stream to lake using open source
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PDF
Cracking the nut, solving edge ai with apache tools and frameworks
PPTX
Spark optimization
PPT
Streaming from the cloud
PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
FLiP Into Trino
Cracking the nut, solving edge ai with apache tools and frameworks
Matt Franklin - Apache Software (Geekfest)
Pass data community summit - 2021 - Real-Time Streaming in Azure with Apache ...
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Using the flipn stack for edge ai (flink, nifi, pulsar)
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
Using the FLiPN stack for edge ai (flink, nifi, pulsar)
StreamNative FLiP into scylladb - scylla summit 2022
Music city data Hail Hydrate! from stream to lake
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Hail hydrate! from stream to lake using open source
Real time stock processing with apache nifi, apache flink and apache kafka
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Kafka & Hadoop - for NYC Kafka Meetup
Cracking the nut, solving edge ai with apache tools and frameworks
Spark optimization
Streaming from the cloud
Big data conference europe real-time streaming in any and all clouds, hybri...
Ad

Viewers also liked (20)

PPTX
Shug meetup Hops Hadoop
PPTX
Hopsfs 10x HDFS performance
PDF
Operational Tips for Deploying Spark
PPTX
Haya Exports
PDF
Modul html
PPT
Unha pequena escolma
PDF
Monografia bioestadisstica
PPTX
【クラーク高校】模擬選挙授業用資料
PPS
Butterfly
PPTX
News reports
PDF
2008 photo identification facial metrical and morphological features in south...
PPS
Nueva droga alerta-6135
PPTX
【模擬選挙×マニフェストスイッチ】開票結果について
DOC
Wcm remedies concl
PPT
Billy Elliot Transcript
PPTX
Retos de la ciencia para el siglo xxi
DOCX
Ensayo de oratoria. alexander
PPT
Negociere.curs 9
PDF
kelas11 smk-biologi-pertanian_ameilia-dkk
PPTX
Retos de la ciencia para el siglo XXI
Shug meetup Hops Hadoop
Hopsfs 10x HDFS performance
Operational Tips for Deploying Spark
Haya Exports
Modul html
Unha pequena escolma
Monografia bioestadisstica
【クラーク高校】模擬選挙授業用資料
Butterfly
News reports
2008 photo identification facial metrical and morphological features in south...
Nueva droga alerta-6135
【模擬選挙×マニフェストスイッチ】開票結果について
Wcm remedies concl
Billy Elliot Transcript
Retos de la ciencia para el siglo xxi
Ensayo de oratoria. alexander
Negociere.curs 9
kelas11 smk-biologi-pertanian_ameilia-dkk
Retos de la ciencia para el siglo XXI
Ad

Similar to Spark summit-east-dowling-feb2017-full (20)

PPTX
Multi-tenant Flink as-a-service with Kafka on Hopsworks
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
PDF
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
PDF
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
PPTX
Cloud-based Data Lake for Analytics and AI
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Aws-What You Need to Know_Simon Elisha
PDF
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
PPTX
On-premise Spark as a Service with YARN
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
PDF
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
PPTX
Stream processing on mobile networks
PPTX
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
PPTX
AWS re:Invent 2016 - Scality's Open Source AWS S3 Server
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
PDF
JConWorld_ Continuous SQL with Kafka and Flink
PPT
AWS (Hadoop) Meetup 30.04.09
PDF
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
PDF
Spark Summit EU talk by Jim Dowling
PPTX
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Cloud-based Data Lake for Analytics and AI
Big Data, Ingeniería de datos, y Data Lakes en AWS
Aws-What You Need to Know_Simon Elisha
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
On-premise Spark as a Service with YARN
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Stream processing on mobile networks
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
AWS re:Invent 2016 - Scality's Open Source AWS S3 Server
Hadoop in Practice (SDN Conference, Dec 2014)
JConWorld_ Continuous SQL with Kafka and Flink
AWS (Hadoop) Meetup 30.04.09
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark Summit EU talk by Jim Dowling
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins

More from Jim Dowling (20)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Hops fs huawei internal conference july 2021
PDF
Hopsworks MLOps World talk june 21
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PDF
GANs for Anti Money Laundering
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
Hopsworks data engineering melbourne april 2020
PDF
The Bitter Lesson of ML Pipelines
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
ARVC and flecainide case report[EI] Jim.docx.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Serverless ML Workshop with Hopsworks at PyData Seattle
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Building Hopsworks, a cloud-native managed feature store for machine learning
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Hops fs huawei internal conference july 2021
Hopsworks MLOps World talk june 21
Hopsworks Feature Store 2.0 a new paradigm
Metadata and Provenance for ML Pipelines with Hopsworks
GANs for Anti Money Laundering
Berlin buzzwords 2020-feature-store-dowling
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Hopsworks data engineering melbourne april 2020
The Bitter Lesson of ML Pipelines
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks in the cloud Berlin Buzzwords 2019

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
GamePlan Trading System Review: Professional Trader's Honest Take
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction

Spark summit-east-dowling-feb2017-full

  • 1. Spark Streaming-as-a- Service with Kafka and YARN Jim Dowling KTH Royal Institute of Technology, Stockholm Senior Researcher, SICS CEO, Logical Clocks AB
  • 2. Spark Streaming-as-a-Service in Sweden • SICS ICE: datacenter research environment • Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop • -as-a-service – Built on Hops Hadoop (www.hops.io) – >130 active users
  • 3. Hadoop is not a cool kid anymore!
  • 5. Hadoop’s Evolution 2009 2016 ? Tiny Brain (NameNode, ResourceMgr) Huge Body (DataNodes)
  • 6. Build out Hadoop’s Brain with External Weakly Consistent MetaData Services Google-Glass Approach to Intelligence
  • 8. Larger Brains => Bigger, Faster* 16x Performance on Spotify Workload *Usenix FAST 2017, HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
  • 9. Hopsworks • Projects – Datasets/Files – Topics – Jobs/Notebooks Hadoop • Clusters • Users • Jobs/Applications • Files • ACLs • Sys Admins • Kerberos Larger Brains => More Intelligent* *HMGA2 gene mutations correlated with increased intracranial volume as well as enhanced IQ. http://newsroom.ucla.edu/releases/international-team-uncovers-new-231989 User-Friendly Concepts http://www.ibtimes.co.uk/embargoed-8pm-25th-jan-size-matters-brain-size-relative-body-size-indicates-animals-ability-1539994
  • 10. YARN Spark Streaming Support • Apache Kafka • ELK Stack – Real-time Logs • Grafana/InfluxDB – Monitoring Hopsworks YARN aggregates logs on job completion http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
  • 11. Kafka Self-Service UI Manage & Share • Topics • ACLs • Avro Schemas Manage & Share • Topics • ACLs • Avro Schemas
  • 14. Zeppelin for Prototyping Streaming Apps [https://github.com/knockdata/spark-highcharts]
  • 15. Debugging Spark with Dr. Elephant • Analyzes Spark Jobs for errors and common using pluggable heuristics • Doesn’t show killed jobs • No online support for streaming apps yet
  • 16. Integration as Microservices in Hopsworks • Project-based Multi-tenancy • Self-Service UI • Simplifying Spark Streaming Apps
  • 17. Proj-All Proj-X Projects in Hopsworks • Proj-42 Shared TopicTopic /Projs/My/Data CompanyDB
  • 18. User roles 18 Data Owner - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist - Write and Run code Self-Service Administration – No Administrator Needed
  • 19. Notebooks, Data sharing and Quotas • Zeppelin Notebooks in HDFS, Jobs launcher UI. • Sharing is not Copying – Datasets/Topics • Per-Project quotas – Storage in HDFS – CPU in YARN (Uber-style Pricing)
  • 21. Look Ma, no Kerberos • Each project-specific user issued with a SSL/TLS (X.509) certificate for both authentication and encryption. • Services also issued with SSL/TLS certificates. – Same root CA as user certs
  • 22. Simplifying Spark Streaming Apps • Spark Streaming Applications need to know – Credentials • Hadoop, Kafka, InfluxDb, Logstash – Endpoints • Kafka Broker, Kafka SchemaRegistry, ResourceManager, NameNode, InfluxDB, Logstash • The HopsUtil API hides this complexity. – Location/security transparent Spark applications
  • 23. Secure Streaming App with Kafka Developer 1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints 2.Create: Kafka Properties file with certs and broker details 3.Create: Producer/Consumer using Kafka Properties 4.Download: the Schema for the Topic from the Schema Registry 5.Distribute: X.509 certs to all hosts on the cluster 6.Cleanup securely These steps are replaced by calls to the HopsUtil API Operations https://github.com/hopshadoop/hops-kafka-examples
  • 24. Streaming Producer in HopsWorks JavaSparkContext jsc = new JavaSparkContext(sparkConf); String topic = HopsUtil.getTopic(); //Optional SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message);
  • 25. Streaming Consumer in HopsWorks JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,Durations.seconds(2)); String topic = HopsUtil.getTopic(); //Optional String consumerGroup = HopsUtil.getConsumerGroup(); //Optional SparkConsumer consumer = HopsUtil.getSparkConsumer(jssc); JavaInputDStream<ConsumerRecord<String, byte[]>> messages = consumer.createDirectStream(); jssc.start();
  • 26. Less code to write https://github.com/hopshadoop/hops-kafka-examples Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList); props.put(SCHEMA_REGISTRY_URL, restApp.restConnect); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); props.put("producer.type", "sync"); props.put("serializer.class","kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks" ) props.put("ssl.keystore.password","test1234") props.put("ssl.key.password","test1234") ProducerConfig config = new ProducerConfig(props); String userSchema = "{"namespace": "example.avro", "type": "record", "name": "U ser"," + ""fields": [{"name": "name", "type": "string"}]}"; Schema.Parser parser = new Schema.Parser(); Schema schema = parser.parse(userSchema); GenericRecord avroRecord = new GenericData.Record(schema); avroRecord.put("name", "testUser"); Producer<String, String> producer = new Producer<String, String>(config); ProducerRecord<String, Object> message = new ProducerRecord<>(“topicName”, avroRecord ); producer.send(data); Lots of Hard-Coded Endpoints Here! SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message); Massively Simplified Code for Secure Spark Streaming/Kafka
  • 27. Distributing Certs for Spark Streaming Alice@gmail.com 1. Launch Spark Job Distributed Database 2. Get certs, service endpoints YARN Private LocalResources Spark Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks HopsUtil 8. Read ACLs for authentication
  • 28. Multi-Tenant IoT Scenario Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Field Gateway StorageStorage AnalysisAnalysis IngestionIngestion ACMEACME Evil CorpEvil Corp IoT Cloud Platform DontBeEvil Corp DontBeEvil Corp
  • 29. IoT Scenario ACME DontBeEvil Corp Evil-Corp AWS Google Cloud Oracle Cloud User Apps control IoT Devices IoT Company: Analyze Data, Data Services for Clients ACME DontBeEvil Corp Evil Corp
  • 30. Cloud-Native Analytics Solution ACME S3S3 [Authorization] GCSGCS OracleOracleIoT Company Each customer needs its own Analytics Infrastructure Each customer needs its own Analytics Infrastructure Spark Streaming App
  • 31. IoT Company Project GatewayTopic Hopsworks Solution using Projects ACME ProjectACMETopic ACME Dataset Data Stream Analytics Reports
  • 32. Hopsworks Solution ACME Spark Streaming App [Authorized] ACME Dataset ACME Dataset ACME Topic ACME Analytics Reports ACME Analytics Reports Spark Batch Job ACME Project
  • 33. Karamel/Chef for Automated Installation Google Compute Engine BareMetal
  • 34. DEMO
  • 35. Hops Roadmap • HopsFS – HA support for Multi-Data-Center – Small files, 2-Level Erasure Coding • HopsYARN – Tensorflow with isolated GPUs • Hopsworks – P2P Dataset Sharing – Jupyter, Presto, Hive
  • 36. Summary • Hops is a new distribution of Hadoop – Tinker-friendly and open-source. • Hopsworks provides first-class support for Spark-Streaming-as-a-Service – With support services like Kafka, ELK Stack, Zeppelin, Grafana/InfluxDB.
  • 37. Hops Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas Siskos. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu. Hops
  • 38. Thank You. We totally understand it’s going to be America First Spark Streaming first, but can we take this chance to say Hopsworks second! http://www.hops.io @hopshadoop Hops