SlideShare a Scribd company logo
Data Science with the Help of Metadata
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
www.hops.io
@hopshadoop
Metadata for Source Code
•Metadata for Source Code
- Enables questions like: who, when, what, why?
•Metadata for Automation
- Enables testing, quality-control, deployment.
•Metadata for Collaboration
- Github projects, teams
Metadata for Datasets?
•Access Control
•Data provenance
•Auditing
•Development
- Schema for the dataset
- How can I load/download this dataset?
- Quality control
3
Metadata can simplify development
sqlContext = HiveContext(sc)
f1_df = sqlContext.sql(
"SELECT id, count(*) AS nb_entries
FROM my_db.log 
WHERE ts = '20160515' 
GROUP BY id"
)
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')
fpFields = [
StructField(‘ts', StringType(), True),
StructField('id', StringType(), True),
StructField(‘it', StringType(), True)
]
fpSchema = StructType(fpFields)
df_f0 = sqlContext.createDataFrame(f0,
fpSchema)
df_f0.registerTempTable('log')
f1_df = sqlContext.sql(
"SELECT log.id, count(*) AS nb_entries
FROM log WHERE ts = '20160515‘
GROUP BY id“
)
4
SparkSQLHive-on-Spark
Hive is Metadata for HDFS files
5
Metadata for Files/Directories in HDFS
6
Add Schemas using
the Filesystem API
Add auditing using
the FSImage API
Add access control using
a Filesystem Plugin
Access Control in Hadoop
hdfs dfs -chmod -R 000 /apps/hive
7
[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
Metadata Totem Poles in Hadoop
8How do you ensure the consistency of the metadata and the data?
Why are the Metadata Services Silo’ed?
9
HDFS v2
10
DataNodes
HDFS Client
Journal Nodes Zookeeper
NameNode Standby
NameNode
Max 200 GB
metadata
YARN
11
NodeManagers
YARN Client
Zookeeper
ResourceMgr Standby
ResourceMgr
Metadata on the
JVM Heap Again
Hops: Distributed Metadata for Hadoop
12
HopsFS Architecture
13
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
> 2.6 X
Throughput
[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]
HopsYARN Architecture
14
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Leader Election for
Failed Scheduler
Up to 10K
Node Clusters
Experience Designing Metadata in Hops
15
Hops Metadata services
Elasticsearch
Database
[HDFS/YARN]
Kafka
Zookeeper
Metadata API
The Distributed Database is the Single Source-of-Truth for Metadata
Metadata for HDFS and YARN
17
Files
Directories
Containers
Provenance
Security
Quotas
Projects
Datasets
Metadata + Data in the same Database
2-phase commit (transactions)
Strong Consistency for Metadata.
Metadata Integrity maintained using 2PC and Foreign Keys.
Metadata in Elasticsearch
18
Files
Directories
Metadata
Search
Indexes
DatabaseElasticsearch one-way replication
Eventual Consistency for Metadata.
Metadata Integrity maintained by Asynchronous Replication.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]
Metadata for Kafka
19
Topics
Partitions
ACLs
Zookeeper/KafkaDatabase
Eventual Consistency for Metadata.
Metadata integrity maintained by custom recovery logic and polling.
Metadata API
polling
Case Study: Self-Service Multi-Tenant Projects
20
www.hops.io
@hopshadoop
Problem: Sensitive Data needs its own Cluster
21
NSA DataSet
User DataSet
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity.
Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice
Solution: Project-Specific UserIDs
22
Project NSA
Project Users
Member of
NSA__Alice
Users__Alice
Member of
HDFS enforces
access control
How can we share DataSets between Projects?
Sharing DataSets between Projects
23
Project NSA
Project Users
Member of
DataSetowns
Add members of Project
NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
HopsWorks (WebApp) Enforces Dynamic Roles
24
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
X.509 Certificate Per Project-Specific User
25
Alice@gmail.com
Authenticate
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests
Project
•A project is a collection of
- Members
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•A project has an owner
•A project has quotas
26
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Project Roles
Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist Privileges
- Write and Run code
27
We delegate administration of privileges to users
Elastic Hadoop
Each Project has a:
• YARN CPU Quota
• HDFS Storage Quota
Uber-Style Pricing to
incentivize cluster usage
28
Sharing DataSets/Topics between Projects
29
The same as Sharing Folders in Dropbox
Added Multi-Tenancy to Zeppelin
www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
Demo
32
Status and Upcoming
•Automated installation support using Vagrant/Chef
or Karamel/Chef
•First official release of Hopsworks coming soon
•Globally shared datasets with peer-to-peer
technology, backed by our data center.
•Support for Apache Beam
Summing Up
Metadata services have the potential to make your
life easier as a Data Scientist
Most Hadoop Metadata services are proprietary and
require an administrator-in-the-loop
Hops provides an open, tinker-friendly platform for
building consistent metadata
Hopsworks shows how you can leverage metadata to
build a self-service project-based model for
Hadoop/Spark/Flink applications
34
The Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Vasileios Giannokostas, Ermias Gebremeskel,
Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,
Paul Mälzer, Bram Leenders, Juan Roca.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Join us!
http://github.com/hopshadoop
www.hops.io
@hopshadoop

More Related Content

PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PPTX
Presto query optimizer: pursuit of performance
PPTX
Querying Druid in SQL with Superset
PDF
Introduction to TitanDB
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PPTX
Enabling Modern Application Architecture using Data.gov open government data
PDF
Big Telco - Yousun Jeong
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Presto query optimizer: pursuit of performance
Querying Druid in SQL with Superset
Introduction to TitanDB
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Enabling Modern Application Architecture using Data.gov open government data
Big Telco - Yousun Jeong

What's hot (20)

PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
PDF
From Batch to Streaming ET(L) with Apache Apex
PDF
Meetup070416 Presentations
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PPTX
Spark sql meetup
PPTX
Data Science at Scale by Sarah Guido
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PDF
What's new in SQL on Hadoop and Beyond
PDF
From R Script to Production Using rsparkling with Navdeep Gill
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
PPTX
Building a Virtual Data Lake with Apache Arrow
PDF
NoSQL no more: SQL on Druid with Apache Calcite
Data Gloveboxes: A Philosophy of Data Science Data Security
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
RISELab:Enabling Intelligent Real-Time Decisions
Open Source Big Data Ingestion - Without the Heartburn!
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
From Batch to Streaming ET(L) with Apache Apex
Meetup070416 Presentations
A Day in the Life of a Druid Implementor and Druid's Roadmap
Spark sql meetup
Data Science at Scale by Sarah Guido
Building Data Intensive Analytic Application on Top of Delta Lakes
What's new in SQL on Hadoop and Beyond
From R Script to Production Using rsparkling with Navdeep Gill
Real-Time Spark: From Interactive Queries to Streaming
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
Apache Arrow at DataEngConf Barcelona 2018
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Virtual Data Lake with Apache Arrow
NoSQL no more: SQL on Druid with Apache Calcite
Ad

Viewers also liked (6)

PDF
Odsc workshop - Distributed Tensorflow on Hops
PDF
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
PPTX
Multi-tenant Flink as-a-service with Kafka on Hopsworks
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
PDF
Spark Summit EU talk by Jim Dowling
Odsc workshop - Distributed Tensorflow on Hops
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Spark Summit EU talk by Jim Dowling
Ad

Similar to Data Science with the Help of Metadata (20)

PPTX
Strata Hadoop Hopsworks
PPTX
Shug meetup Hops Hadoop
PPTX
Hops - Distributed metadata for Hadoop
PPTX
Polyglot metadata for Hadoop
PDF
Spark summit-east-dowling-feb2017-full
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PDF
Hadoop Ecosystem
PDF
Beyond Hadoop and MapReduce
PPTX
The ExtremeEarth infrastructure-phiweek19
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PDF
Hops fs huawei internal conference july 2021
PDF
Ceph Day San Jose - Object Storage for Big Data
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
On-premise Spark as a Service with YARN
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
Hadoop and object stores can we do it better
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Strata Hadoop Hopsworks
Shug meetup Hops Hadoop
Hops - Distributed metadata for Hadoop
Polyglot metadata for Hadoop
Spark summit-east-dowling-feb2017-full
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Hopsworks in the cloud Berlin Buzzwords 2019
Hadoop Ecosystem
Beyond Hadoop and MapReduce
The ExtremeEarth infrastructure-phiweek19
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Hops fs huawei internal conference july 2021
Ceph Day San Jose - Object Storage for Big Data
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
On-premise Spark as a Service with YARN
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Apache Hadoop & Friends at Utah Java User's Group
Hadoop and object stores can we do it better
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]

More from Jim Dowling (20)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Hopsworks MLOps World talk june 21
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PDF
GANs for Anti Money Laundering
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
Hopsworks data engineering melbourne april 2020
PDF
The Bitter Lesson of ML Pipelines
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
ARVC and flecainide case report[EI] Jim.docx.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Serverless ML Workshop with Hopsworks at PyData Seattle
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Building Hopsworks, a cloud-native managed feature store for machine learning
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Hopsworks MLOps World talk june 21
Hopsworks Feature Store 2.0 a new paradigm
Metadata and Provenance for ML Pipelines with Hopsworks
GANs for Anti Money Laundering
Berlin buzzwords 2020-feature-store-dowling
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Hopsworks data engineering melbourne april 2020
The Bitter Lesson of ML Pipelines
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Hopsworks at Google AI Huddle, Sunnyvale
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
PyData Meetup - Feature Store for Hopsworks and ML Pipelines

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Data Science with the Help of Metadata

  • 1. Data Science with the Help of Metadata Jim Dowling Associate Prof @ KTH Senior Researcher @ SICS CEO @ Logical Clocks AB www.hops.io @hopshadoop
  • 2. Metadata for Source Code •Metadata for Source Code - Enables questions like: who, when, what, why? •Metadata for Automation - Enables testing, quality-control, deployment. •Metadata for Collaboration - Github projects, teams
  • 3. Metadata for Datasets? •Access Control •Data provenance •Auditing •Development - Schema for the dataset - How can I load/download this dataset? - Quality control 3
  • 4. Metadata can simplify development sqlContext = HiveContext(sc) f1_df = sqlContext.sql( "SELECT id, count(*) AS nb_entries FROM my_db.log WHERE ts = '20160515' GROUP BY id" ) sqlContext = SQLContext(sc) f0 = sc.textFile('logfile') fpFields = [ StructField(‘ts', StringType(), True), StructField('id', StringType(), True), StructField(‘it', StringType(), True) ] fpSchema = StructType(fpFields) df_f0 = sqlContext.createDataFrame(f0, fpSchema) df_f0.registerTempTable('log') f1_df = sqlContext.sql( "SELECT log.id, count(*) AS nb_entries FROM log WHERE ts = '20160515‘ GROUP BY id“ ) 4 SparkSQLHive-on-Spark
  • 5. Hive is Metadata for HDFS files 5
  • 6. Metadata for Files/Directories in HDFS 6 Add Schemas using the Filesystem API Add auditing using the FSImage API Add access control using a Filesystem Plugin
  • 7. Access Control in Hadoop hdfs dfs -chmod -R 000 /apps/hive 7 [http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
  • 8. Metadata Totem Poles in Hadoop 8How do you ensure the consistency of the metadata and the data?
  • 9. Why are the Metadata Services Silo’ed? 9
  • 10. HDFS v2 10 DataNodes HDFS Client Journal Nodes Zookeeper NameNode Standby NameNode Max 200 GB metadata
  • 12. Hops: Distributed Metadata for Hadoop 12
  • 13. HopsFS Architecture 13 NameNodes NDB Leader HDFS Client DataNodes > 12 TB > 2.6 X Throughput [HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]
  • 14. HopsYARN Architecture 14 ResourceMgrs NDB Scheduler YARN Client NodeManagers Resource Trackers Leader Election for Failed Scheduler Up to 10K Node Clusters
  • 16. Hops Metadata services Elasticsearch Database [HDFS/YARN] Kafka Zookeeper Metadata API The Distributed Database is the Single Source-of-Truth for Metadata
  • 17. Metadata for HDFS and YARN 17 Files Directories Containers Provenance Security Quotas Projects Datasets Metadata + Data in the same Database 2-phase commit (transactions) Strong Consistency for Metadata. Metadata Integrity maintained using 2PC and Foreign Keys.
  • 18. Metadata in Elasticsearch 18 Files Directories Metadata Search Indexes DatabaseElasticsearch one-way replication Eventual Consistency for Metadata. Metadata Integrity maintained by Asynchronous Replication. [ePipe Tutorial, BOSS Workshop, VLDB 2016]
  • 19. Metadata for Kafka 19 Topics Partitions ACLs Zookeeper/KafkaDatabase Eventual Consistency for Metadata. Metadata integrity maintained by custom recovery logic and polling. Metadata API polling
  • 20. Case Study: Self-Service Multi-Tenant Projects 20 www.hops.io @hopshadoop
  • 21. Problem: Sensitive Data needs its own Cluster 21 NSA DataSet User DataSet Alice can copy/cross-link between data sets Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop. Alice
  • 22. Solution: Project-Specific UserIDs 22 Project NSA Project Users Member of NSA__Alice Users__Alice Member of HDFS enforces access control How can we share DataSets between Projects?
  • 23. Sharing DataSets between Projects 23 Project NSA Project Users Member of DataSetowns Add members of Project NSA to the DataSet group NSA__Alice Users__Alice Member of
  • 24. HopsWorks (WebApp) Enforces Dynamic Roles 24 Alice@gmail.com NSA__Alice Authenticate Users__Alice HopsWorks HopsFS HopsYARN Projects Secure Impersonation Kafka X.509 Certificates
  • 25. X.509 Certificate Per Project-Specific User 25 Alice@gmail.com Authenticate Add/Del Users Distributed Database Insert/Remove CertsProject Mgr Root CA Services Hadoop Spark Kafka etc Cert Signing Requests
  • 26. Project •A project is a collection of - Members - HDFS DataSets - Kafka Topics - Notebooks, Jobs •A project has an owner •A project has quotas 26 project dataset 1 dataset N Topic 1 Topic N Kafka HDFS
  • 27. Project Roles Data Owner Privileges - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist Privileges - Write and Run code 27 We delegate administration of privileges to users
  • 28. Elastic Hadoop Each Project has a: • YARN CPU Quota • HDFS Storage Quota Uber-Style Pricing to incentivize cluster usage 28
  • 29. Sharing DataSets/Topics between Projects 29 The same as Sharing Folders in Dropbox
  • 31. www.hops.site 31 A 2 MW datacenter research and test environment 5 lab modules, planned up to 3-4000 servers, 2-3000 square meters [Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
  • 33. Status and Upcoming •Automated installation support using Vagrant/Chef or Karamel/Chef •First official release of Hopsworks coming soon •Globally shared datasets with peer-to-peer technology, backed by our data center. •Support for Apache Beam
  • 34. Summing Up Metadata services have the potential to make your life easier as a Data Scientist Most Hadoop Metadata services are proprietary and require an administrator-in-the-loop Hops provides an open, tinker-friendly platform for building consistent metadata Hopsworks shows how you can leverage metadata to build a self-service project-based model for Hadoop/Spark/Flink applications 34
  • 35. The Team Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Johan Svedlund Nordström, Vasileios Giannokostas, Ermias Gebremeskel, Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca. Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.