SlideShare a Scribd company logo
© Cloudera, Inc. All rights reserved.
Hadoop Encryption
Wei-Chiu Chuang, Cloudera
© Cloudera, Inc. All rights reserved.
Why Encryption
• Information leaks affect 10s to 100s of millions of people
• Personally identifiable information (PII)
• Credit cards, SSNs, account logins
• Encryption would have prevented some of these leaks
• Encryption is a regulatory requirement for many business sectors
• Finance (PCI DSS)
• Government (Data Protection Directive)
• Healthcare (DPD, HIPAA)
© Cloudera, Inc. All rights reserved.
Related Technologies
Data In-Motion Encryption
• SSL/TLS
• Hadoop Data Transfer Encryption
• Hadoop RPC Encryption
Data At-Rest Encryption
• At Linux volume level
• Transparent Encryption at HDFS level
• HBase Column Family level
• Parquet Column level Encryption Xinli
© Cloudera, Inc. All rights reserved.
HDFS Transparent Encryption: In a Nutshell
HDFS
Namespace
/
/data /tmp
/data/1 /data/f2
Encryption
zone
Encryption
zone key Data Encryption
Key (per file)
© Cloudera, Inc. All rights reserved.
HDFS Transparent Encryption: In a Nutshell
Client
KMS
MS
NN
DN
NameNode
DataNode
Key
Management
Server
in EZ?
© Cloudera, Inc. All rights reserved.
Features
• Minor performance impact on HDFS reads and writes
• OpenSSL and AES-NI acceleration
• 7.5% for reads, ~0% for writes
• Key ACLs
• Warm-up/Caching (*)
• Key rollover
© Cloudera, Inc. All rights reserved.
Dev History
• First released in Hadoop 2.6.0/ CDH5.3 in 2014 December
• Many, many bug fixes and enhancements
• Functional bugs, failure handling bugs, scale bugs
• Stable after Hadoop 2.8 / CDH5.11-ish
© Cloudera, Inc. All rights reserved.
Lesson Learned
Scale-out is not easy to deploy
Security
Endurance, scale tests are essential
Too little emphasis on KMS as a performance bottleneck
FileSystem#getDelegationToken() API/integration
High throughput REST API layer is hard
© Cloudera, Inc. All rights reserved.
Status Quo
Among Cloudera’s customers (pre-merger):
• 14% Data Transfer Encryption
• 16% Data at Rest Encryption
• 19% RPC Encryption
• 44% Kerberized
Largest at-rest encryption cluster: ~1,000 nodes, > 50PB
© Cloudera, Inc. All rights reserved.
Troubleshooting
Performance anomaly
• Openssl-devel lib
• Entropy
• rng-tools
• Secure Random
• hadoop.security.secure.random.impl = org.apache.hadoop.crypto.random
.OpensslSecureRandom
Proxy user configuration
© Cloudera, Inc. All rights reserved.
Bad Practices
• No KMS HA
• KMS enabled, RPC encryption not enabled
• KMS enabled, but no Kerberos
• KMS w/o SSL
• Data transfer encryption is enabled, but using an unoptimized crypto algorithm
• 3DES, RC4, AES-NI
© Cloudera, Inc. All rights reserved.
Challenges
KMS Low Throughput
• NN can sustain > 100 thousand RPC ops/second
• namespace ops, block reports
• KMS: at most a few thousand RPC ops/second
• create, append, read, reencrypt
• 3-4 KMS servers not uncommon
• Jetty
• SSL Handshake
• Impala/Parquet with wide tables (> 100 columns)
© Cloudera, Inc. All rights reserved.
Future
Pluggable KMS ACL Framework (HADOOP-14951)
WebHDFS At Rest Encryption Support (HDFS-12355)
NFS Gateway At Rest Encryption Support (HDFS-13521)
Performance Improvements (HADOOP-15743, HADOOP-15811)
KMS Benchmark Tool (HADOOP-15967)
KMS over Hadoop RPC?
© Cloudera, Inc. All rights reserved.
●Current KMS Transport Layer
KMSClient
Jetty
http
client
Name
Node
http
client
REST API/HTTP
© Cloudera, Inc. All rights reserved.
KMS over Hadoop RPC
Benefit of KMS over Hadoop RPC:
• Proven performance
• Code reuse
KMSClient
Name
Node
Hadoop RPCHadoop
RPC
Hadoop
RPC
Hadoop
RPC

More Related Content

PPTX
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
PPTX
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
PPTX
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
PDF
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
PDF
Ozone - Evolution of hdfs scalability
PPTX
Introduction to Redis
PPTX
Hadoop Storage in the Cloud Native Era
PPTX
Ceph Deployment at Target: Customer Spotlight
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone - Evolution of hdfs scalability
Introduction to Redis
Hadoop Storage in the Cloud Native Era
Ceph Deployment at Target: Customer Spotlight

What's hot (20)

PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PDF
Ceph as software define storage
PPTX
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
PDF
Red Hat Storage 2014 - Product(s) Overview
DOCX
Redis vs Memcached
PDF
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
PDF
Troubleshooting redis
PDF
Glusterfs and openstack
PDF
Red Hat Storage for Mere Mortals
PDF
Building Scalable, Real Time Applications for Financial Services with DataStax
PDF
CEPH DAY BERLIN - CEPH MANAGEMENT THE EASY AND RELIABLE WAY
ODP
Dustin Black - Red Hat Storage Server Administration Deep Dive
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
PPTX
Ceph Introduction 2017
PPTX
Scaling HDFS at Xiaomi
PDF
Red Hat Storage Server Administration Deep Dive
PDF
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
PDF
Experiences building a distributed shared log on RADOS - Noah Watkins
PDF
What's New with Ceph - Ceph Day Silicon Valley
HBase Tales From the Trenches - Short stories about most common HBase operati...
Ceph as software define storage
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Storage 2014 - Product(s) Overview
Redis vs Memcached
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
Troubleshooting redis
Glusterfs and openstack
Red Hat Storage for Mere Mortals
Building Scalable, Real Time Applications for Financial Services with DataStax
CEPH DAY BERLIN - CEPH MANAGEMENT THE EASY AND RELIABLE WAY
Dustin Black - Red Hat Storage Server Administration Deep Dive
Practical NoSQL: Accumulo's dirlist Example
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Ceph Introduction 2017
Scaling HDFS at Xiaomi
Red Hat Storage Server Administration Deep Dive
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Experiences building a distributed shared log on RADOS - Noah Watkins
What's New with Ceph - Ceph Day Silicon Valley
Ad

Similar to Hadoop Meetup Jan 2019 - Hadoop Encryption (20)

PPTX
Project Rhino: Enhancing Data Protection for Hadoop
PPTX
Securing Spark Applications
PPTX
Fighting cyber fraud with hadoop
PDF
PCI Compliane With Hadoop
PPTX
The Future of Hadoop Security - Hadoop Summit 2014
PPTX
The Future of Data Management - the Enterprise Data Hub
PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
PPTX
Risk Management for Data: Secured and Governed
PDF
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
PPTX
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
PPTX
Saving the elephant—now, not later
PPTX
Transparent Encryption in HDFS
PDF
BigData Security - A Point of View
PDF
Hadoop security implementationon 20171003
PPTX
Security implementation on hadoop
PPTX
End to End Streaming Architectures
PPTX
Open Source Security Tools for Big Data
PPTX
Open Source Security Tools for Big Data
PPTX
Owasp Indy Q2 2012 Cheat Sheet Overview
PPTX
Big data security
Project Rhino: Enhancing Data Protection for Hadoop
Securing Spark Applications
Fighting cyber fraud with hadoop
PCI Compliane With Hadoop
The Future of Hadoop Security - Hadoop Summit 2014
The Future of Data Management - the Enterprise Data Hub
Hadoop security @ Philly Hadoop Meetup May 2015
Risk Management for Data: Secured and Governed
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Saving the elephant—now, not later
Transparent Encryption in HDFS
BigData Security - A Point of View
Hadoop security implementationon 20171003
Security implementation on hadoop
End to End Streaming Architectures
Open Source Security Tools for Big Data
Open Source Security Tools for Big Data
Owasp Indy Q2 2012 Cheat Sheet Overview
Big data security
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
KodekX | Application Modernization Development
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
KodekX | Application Modernization Development
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx

Hadoop Meetup Jan 2019 - Hadoop Encryption

  • 1. © Cloudera, Inc. All rights reserved. Hadoop Encryption Wei-Chiu Chuang, Cloudera
  • 2. © Cloudera, Inc. All rights reserved. Why Encryption • Information leaks affect 10s to 100s of millions of people • Personally identifiable information (PII) • Credit cards, SSNs, account logins • Encryption would have prevented some of these leaks • Encryption is a regulatory requirement for many business sectors • Finance (PCI DSS) • Government (Data Protection Directive) • Healthcare (DPD, HIPAA)
  • 3. © Cloudera, Inc. All rights reserved. Related Technologies Data In-Motion Encryption • SSL/TLS • Hadoop Data Transfer Encryption • Hadoop RPC Encryption Data At-Rest Encryption • At Linux volume level • Transparent Encryption at HDFS level • HBase Column Family level • Parquet Column level Encryption Xinli
  • 4. © Cloudera, Inc. All rights reserved. HDFS Transparent Encryption: In a Nutshell HDFS Namespace / /data /tmp /data/1 /data/f2 Encryption zone Encryption zone key Data Encryption Key (per file)
  • 5. © Cloudera, Inc. All rights reserved. HDFS Transparent Encryption: In a Nutshell Client KMS MS NN DN NameNode DataNode Key Management Server in EZ?
  • 6. © Cloudera, Inc. All rights reserved. Features • Minor performance impact on HDFS reads and writes • OpenSSL and AES-NI acceleration • 7.5% for reads, ~0% for writes • Key ACLs • Warm-up/Caching (*) • Key rollover
  • 7. © Cloudera, Inc. All rights reserved. Dev History • First released in Hadoop 2.6.0/ CDH5.3 in 2014 December • Many, many bug fixes and enhancements • Functional bugs, failure handling bugs, scale bugs • Stable after Hadoop 2.8 / CDH5.11-ish
  • 8. © Cloudera, Inc. All rights reserved. Lesson Learned Scale-out is not easy to deploy Security Endurance, scale tests are essential Too little emphasis on KMS as a performance bottleneck FileSystem#getDelegationToken() API/integration High throughput REST API layer is hard
  • 9. © Cloudera, Inc. All rights reserved. Status Quo Among Cloudera’s customers (pre-merger): • 14% Data Transfer Encryption • 16% Data at Rest Encryption • 19% RPC Encryption • 44% Kerberized Largest at-rest encryption cluster: ~1,000 nodes, > 50PB
  • 10. © Cloudera, Inc. All rights reserved. Troubleshooting Performance anomaly • Openssl-devel lib • Entropy • rng-tools • Secure Random • hadoop.security.secure.random.impl = org.apache.hadoop.crypto.random .OpensslSecureRandom Proxy user configuration
  • 11. © Cloudera, Inc. All rights reserved. Bad Practices • No KMS HA • KMS enabled, RPC encryption not enabled • KMS enabled, but no Kerberos • KMS w/o SSL • Data transfer encryption is enabled, but using an unoptimized crypto algorithm • 3DES, RC4, AES-NI
  • 12. © Cloudera, Inc. All rights reserved. Challenges KMS Low Throughput • NN can sustain > 100 thousand RPC ops/second • namespace ops, block reports • KMS: at most a few thousand RPC ops/second • create, append, read, reencrypt • 3-4 KMS servers not uncommon • Jetty • SSL Handshake • Impala/Parquet with wide tables (> 100 columns)
  • 13. © Cloudera, Inc. All rights reserved. Future Pluggable KMS ACL Framework (HADOOP-14951) WebHDFS At Rest Encryption Support (HDFS-12355) NFS Gateway At Rest Encryption Support (HDFS-13521) Performance Improvements (HADOOP-15743, HADOOP-15811) KMS Benchmark Tool (HADOOP-15967) KMS over Hadoop RPC?
  • 14. © Cloudera, Inc. All rights reserved. ●Current KMS Transport Layer KMSClient Jetty http client Name Node http client REST API/HTTP
  • 15. © Cloudera, Inc. All rights reserved. KMS over Hadoop RPC Benefit of KMS over Hadoop RPC: • Proven performance • Code reuse KMSClient Name Node Hadoop RPCHadoop RPC Hadoop RPC Hadoop RPC

Editor's Notes

  • #7: Caching improves performance. But some users’ environment prohibit caching due to security concerns.
  • #9: KMS was designed to be horizontally scalable. However, because Cloudera recommend 2 KMS-HA and 2 Keytrustee Servers for production workload, the cost for HA is high.